[Users] frequent kernel crashes with recent OpenVZ kernel in CentOS6

Wed Jul 3 10:58:35 EDT 2013

Hello Vasily,

Am 2013-07-03 15:10, schrieb Vasily Averin:
> Dear Jonas,
> it's hardware problem.
> Machine Check Exception cannot be generated by software errors.

Thanks for the prompt reply. I already expected this. Will do some more 
hardware replacements before going deeper into debugging.

Kind regards,
  jonas

> On 07/03/2013 02:12 PM, Jonas Meurer wrote:
>> Hello,
>> 
>> I experience frequent kernel crashes with the CentOS vzkernel provided 
>> by OpenVZ. After installing the corresponding vzkernel-devel and 
>> -debuginfo packages, I now have a crash coredump available for 
>> debugging.
>> 
>> These are my first steps at debugging a kernel crash, so please bear 
>> with me :)
>> 
>> The crash seems to happen in function start_secondary() of 
>> arch/x86/kernel/smpboot.c. To be exact, in line 380, which is the 
>> closing bracket of the function.
>> 
>> Here's the relevant log output:
>> 
>> [  777.533212] [Hardware Error]: CPU 12: Machine Check Exception: 4 
>> Bank 2: b200000000020005
>> [  777.533248] [Hardware Error]: TSC 1c181a41de4
>> [  777.533272] [Hardware Error]: PROCESSOR 0:206c0 TIME 1372805901 
>> SOCKET 0 APIC 1
>> [  777.533304] [Hardware Error]: CPU 0: Machine Check Exception: 4 
>> Bank 2: b200000000020005
>> [  777.533335] [Hardware Error]: TSC 1c181a41d64
>> [  777.533358] [Hardware Error]: PROCESSOR 0:206c0 TIME 1372805901 
>> SOCKET 0 APIC 0
>> [  777.533387] [Hardware Error]: Machine check: Processor context 
>> corrupt
>> [  777.533413] Kernel panic - not syncing: Fatal Machine check
>> [  777.533437] Pid: 0, comm: swapper veid: 0 Tainted: G   M       
>> ---------------    2.6.32-042stab078.27.debug #1
>> [  777.533476] Call Trace:
>> [  777.533487]  <#MC>  [<ffffffff81556011>] ? panic+0xac/0x179
>> [  777.533523]  [<ffffffff81026aaf>] ? mce_panic+0x20f/0x230
>> [  777.533547]  [<ffffffff810280f8>] ? do_machine_check+0xa38/0xa80
>> [  777.533576]  [<ffffffff8155b2a1>] ? machine_check+0x21/0x30
>> [  777.533602]  [<ffffffff813145e6>] ? intel_idle+0xb6/0x190
>> [  777.533624]  <<EOE>>  [<ffffffff8155e39d>] ? 
>> __atomic_notifier_call_chain+0x9d/0xc0
>> [  777.533771]  [<ffffffff8145ce68>] ? menu_select+0x178/0x360
>> [  777.533850]  [<ffffffff8145bce7>] ? cpuidle_idle_call+0xa7/0x170
>> [  777.533934]  [<ffffffff8100a06b>] ? cpu_idle+0xbb/0x110
>> [  777.534012]  [<ffffffff8154f783>] ? start_secondary+0x2cb/0x30e
>> 
>> I attached the full output of disassembled memory starting with 
>> address ffffffff8154f783 ('dis -rl ffffffff8154f783') to this mail.
>> 
>> To be honest, I don't know whether the crash is related to the OpenVZ 
>> patches in vzkernel at all. If I got it right, it happens when a 
>> secondary processor is activated, i.e. when a cpu is booted in order 
>> to take a process, right?.
>> 
>> As the crash happens on an OpenVZ kernel, I'm reporting it here first. 
>> Unfortunately it's not an option to try either vanilla upstream or 
>> CentOS mainline kernel. The server is a production system, and OpenVZ 
>> is critical to the provided services.
>> 
>> Maybe the crash is related to hardware issues? I already replaced both 
>> mainboard and CPUs, so at least broken chips are rather unlikely. 
>> Below follow some details about the hardware:
>> 
>> Tyan S7025 board
>> 2x Intel Xeon L5640 CPU
>> 48GB Kingston Registered ECC Memory
>> LSI MegaRAID SAS 2108 RAID Controller with 29 physical drives in three 
>> RAID10 arrays (with two Chenbro expander cards)
>> 
>> 
>> Issues with the particular used board or CPU are rather unlikely due 
>> to the replacements. But I cannot exclude hardware bugs in the used 
>> series.
>> 
>> 
>> I hope that you have further advice for me on how to proceed with the 
>> debugging. At the moment I'm rather lost :(
>> 
>> Kind regards,
>>  jonas
>> 
>> 
>> _______________________________________________
>> Users mailing list
>> Users at openvz.org
>> https://lists.openvz.org/mailman/listinfo/users
>>