[Users] frequent kernel crashes with recent OpenVZ kernel in CentOS6

Fri Jul 5 10:54:27 EDT 2013

It might be due to temperature issues or simply because nobody used that faulty memory range since then.
But overall there is a statistics that if it failed once - probability to fail again is >10x times more then if it never happened.

On Jul 5, 2013, at 16:25 , Aleksandar Ivanisevic <aleksandar at ivanisevic.de> wrote:

> 
> FWIW, I got the similar MCE 7 days ago with 2.6.32-042stab076.5. It
> was after 66 days of uptime and hasn't happened since so I'm not yet
> doing anything about it, but you might want to try downgrading the
> kernel a version or two to see what happens, that's what I would do.
> 
> Jun 27 16:07:37 [5041215.152135] Disabling lock debugging due to kernel taint 
> Jun 27 16:07:37 [5041215.147208] [Hardware Error]: CPU 7: Machine Check Exception: 5 Bank 5: b200000802000e0f 
> Jun 27 16:07:38 [5041215.147208] [Hardware Error]: RIP !INEXACT! 10:<ffffffff81014727> {mwait_idle+0x77/0xd0} 
> Jun 27 16:07:38 [5041215.147208] [Hardware Error]: TSC 2ca9d75e8b36c8  
> Jun 27 16:07:38 [5041215.147208] [Hardware Error]: PROCESSOR 0:1067a TIME 1372342055 SOCKET 1 APIC 7 
> Jun 27 16:07:38 [5041215.147208] [Hardware Error]: Some CPUs didn't answer in synchronization 
> Jun 27 16:07:38 [5041215.147208] [Hardware Error]: Machine check: Processor context corrupt 
> Jun 27 16:07:38 [5041215.147208] Kernel panic - not syncing: Fatal machine check on current CPU 
> Jun 27 16:07:38 [5041215.147208] Pid: 0, comm: swapper veid: 0 Tainted: G   M       ---------------    2.6.32-042stab076.5 #1 
> Jun 27 16:07:38 [5041215.147208] Call Trace: 
> Jun 27 16:07:38 [5041215.147208]  <#MC>  [<ffffffff814f0dbe>] ? panic+0xa0/0x168 
> Jun 27 16:07:38 [5041215.147208]  [<ffffffff8102280f>] ? mce_panic+0x21f/0x240 
> Jun 27 16:07:38 [5041215.147208]  [<ffffffff81023b73>] ? do_machine_check+0x833/0xa60 
> Jun 27 16:07:38 [5041215.147208]  [<ffffffff81014727>] ? mwait_idle+0x77/0xd0 
> Jun 27 16:07:38 [5041215.147208]  [<ffffffff814f4a0c>] ? machine_check+0x1c/0x30 
> Jun 27 16:07:38 [5041215.147208]  [<ffffffff81014727>] ? mwait_idle+0x77/0xd0 
> Jun 27 16:07:38 [5041215.147208]  <<EOE>>  [<ffffffff8100a026>] ? cpu_idle+0xb6/0x110 
> Jun 27 16:07:38 [5041215.147208]  [<ffffffff814eaa2e>] ? start_secondary+0x22a/0x26d 
> Jun 27 16:07:38 [5041215.147208] panic occurred, switching back to text console 
> Jun 27 16:07:38 [5041216.652619] Rebooting in 30 seconds..
> 
> 
> Vasily Averin <vvs at parallels.com> writes:
> 
>> Dear Jonas,
>> it's hardware problem.
>> Machine Check Exception cannot be generated by software errors.
>> 
>> thank you,
>> 	Vasily Averin
>> 
>> On 07/03/2013 02:12 PM, Jonas Meurer wrote:
>>> Hello,
>>> 
>>> I experience frequent kernel crashes with the CentOS vzkernel provided by OpenVZ. After installing the corresponding vzkernel-devel and -debuginfo packages, I now have a crash coredump available for debugging.
>>> 
>>> These are my first steps at debugging a kernel crash, so please bear with me :)
>>> 
>>> The crash seems to happen in function start_secondary() of arch/x86/kernel/smpboot.c. To be exact, in line 380, which is the closing bracket of the function.
>>> 
>>> Here's the relevant log output:
>>> 
>>> [  777.533212] [Hardware Error]: CPU 12: Machine Check Exception: 4 Bank 2: b200000000020005
>>> [  777.533248] [Hardware Error]: TSC 1c181a41de4
>>> [  777.533272] [Hardware Error]: PROCESSOR 0:206c0 TIME 1372805901 SOCKET 0 APIC 1
>>> [  777.533304] [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 2: b200000000020005
>>> [  777.533335] [Hardware Error]: TSC 1c181a41d64
>>> [  777.533358] [Hardware Error]: PROCESSOR 0:206c0 TIME 1372805901 SOCKET 0 APIC 0
>>> [  777.533387] [Hardware Error]: Machine check: Processor context corrupt
>>> [  777.533413] Kernel panic - not syncing: Fatal Machine check
>>> [  777.533437] Pid: 0, comm: swapper veid: 0 Tainted: G   M       ---------------    2.6.32-042stab078.27.debug #1
>>> [  777.533476] Call Trace:
>>> [  777.533487]  <#MC>  [<ffffffff81556011>] ? panic+0xac/0x179
>>> [  777.533523]  [<ffffffff81026aaf>] ? mce_panic+0x20f/0x230
>>> [  777.533547]  [<ffffffff810280f8>] ? do_machine_check+0xa38/0xa80
>>> [  777.533576]  [<ffffffff8155b2a1>] ? machine_check+0x21/0x30
>>> [  777.533602]  [<ffffffff813145e6>] ? intel_idle+0xb6/0x190
>>> [  777.533624]  <<EOE>>  [<ffffffff8155e39d>] ? __atomic_notifier_call_chain+0x9d/0xc0
>>> [  777.533771]  [<ffffffff8145ce68>] ? menu_select+0x178/0x360
>>> [  777.533850]  [<ffffffff8145bce7>] ? cpuidle_idle_call+0xa7/0x170
>>> [  777.533934]  [<ffffffff8100a06b>] ? cpu_idle+0xbb/0x110
>>> [  777.534012]  [<ffffffff8154f783>] ? start_secondary+0x2cb/0x30e
>>> 
>>> I attached the full output of disassembled memory starting with address ffffffff8154f783 ('dis -rl ffffffff8154f783') to this mail.
>>> 
>>> To be honest, I don't know whether the crash is related to the OpenVZ patches in vzkernel at all. If I got it right, it happens when a secondary processor is activated, i.e. when a cpu is booted in order to take a process, right?.
>>> 
>>> As the crash happens on an OpenVZ kernel, I'm reporting it here first. Unfortunately it's not an option to try either vanilla upstream or CentOS mainline kernel. The server is a production system, and OpenVZ is critical to the provided services.
>>> 
>>> Maybe the crash is related to hardware issues? I already replaced both mainboard and CPUs, so at least broken chips are rather unlikely. Below follow some details about the hardware:
>>> 
>>> Tyan S7025 board
>>> 2x Intel Xeon L5640 CPU
>>> 48GB Kingston Registered ECC Memory
>>> LSI MegaRAID SAS 2108 RAID Controller with 29 physical drives in three RAID10 arrays (with two Chenbro expander cards)
>>> 
>>> 
>>> Issues with the particular used board or CPU are rather unlikely due to the replacements. But I cannot exclude hardware bugs in the used series.
>>> 
>>> 
>>> I hope that you have further advice for me on how to proceed with the debugging. At the moment I'm rather lost :(
>>> 
>>> Kind regards,
>>> jonas
>>> 
>>> 
>>> _______________________________________________
>>> Users mailing list
>>> Users at openvz.org
>>> https://lists.openvz.org/mailman/listinfo/users
>>> 
> 
> -- 
> Ti si arogantan, prepotentan i peglaš vlastitu frustraciju. -- Ivan
> Tišljar, hr.comp.os.linux
> 
> _______________________________________________
> Users mailing list
> Users at openvz.org
> https://lists.openvz.org/mailman/listinfo/users