[Users] frequent kernel crashes with recent OpenVZ kernel in CentOS6

Wed Jul 3 11:58:15 EDT 2013

The best course of action is to file a bug to bugzilla.openvz.org, 
product: OpenVZ, component: kernel. This way the bug can be tracked.

Also, while you don't say it, it looks like the older kernel works for 
you, so please specify which kernel works and which doesn't.

On 07/03/2013 03:12 AM, Jonas Meurer wrote:
> Hello,
>
> I experience frequent kernel crashes with the CentOS vzkernel provided 
> by OpenVZ. After installing the corresponding vzkernel-devel and 
> -debuginfo packages, I now have a crash coredump available for debugging.
>
> These are my first steps at debugging a kernel crash, so please bear 
> with me :)
>
> The crash seems to happen in function start_secondary() of 
> arch/x86/kernel/smpboot.c. To be exact, in line 380, which is the 
> closing bracket of the function.
>
> Here's the relevant log output:
>
> [  777.533212] [Hardware Error]: CPU 12: Machine Check Exception: 4 
> Bank 2: b200000000020005
> [  777.533248] [Hardware Error]: TSC 1c181a41de4
> [  777.533272] [Hardware Error]: PROCESSOR 0:206c0 TIME 1372805901 
> SOCKET 0 APIC 1
> [  777.533304] [Hardware Error]: CPU 0: Machine Check Exception: 4 
> Bank 2: b200000000020005
> [  777.533335] [Hardware Error]: TSC 1c181a41d64
> [  777.533358] [Hardware Error]: PROCESSOR 0:206c0 TIME 1372805901 
> SOCKET 0 APIC 0
> [  777.533387] [Hardware Error]: Machine check: Processor context corrupt
> [  777.533413] Kernel panic - not syncing: Fatal Machine check
> [  777.533437] Pid: 0, comm: swapper veid: 0 Tainted: G   M 
> ---------------    2.6.32-042stab078.27.debug #1
> [  777.533476] Call Trace:
> [  777.533487]  <#MC>  [<ffffffff81556011>] ? panic+0xac/0x179
> [  777.533523]  [<ffffffff81026aaf>] ? mce_panic+0x20f/0x230
> [  777.533547]  [<ffffffff810280f8>] ? do_machine_check+0xa38/0xa80
> [  777.533576]  [<ffffffff8155b2a1>] ? machine_check+0x21/0x30
> [  777.533602]  [<ffffffff813145e6>] ? intel_idle+0xb6/0x190
> [  777.533624]  <<EOE>>  [<ffffffff8155e39d>] ? 
> __atomic_notifier_call_chain+0x9d/0xc0
> [  777.533771]  [<ffffffff8145ce68>] ? menu_select+0x178/0x360
> [  777.533850]  [<ffffffff8145bce7>] ? cpuidle_idle_call+0xa7/0x170
> [  777.533934]  [<ffffffff8100a06b>] ? cpu_idle+0xbb/0x110
> [  777.534012]  [<ffffffff8154f783>] ? start_secondary+0x2cb/0x30e
>
> I attached the full output of disassembled memory starting with 
> address ffffffff8154f783 ('dis -rl ffffffff8154f783') to this mail.
>
> To be honest, I don't know whether the crash is related to the OpenVZ 
> patches in vzkernel at all. If I got it right, it happens when a 
> secondary processor is activated, i.e. when a cpu is booted in order 
> to take a process, right?.
>
> As the crash happens on an OpenVZ kernel, I'm reporting it here first. 
> Unfortunately it's not an option to try either vanilla upstream or 
> CentOS mainline kernel. The server is a production system, and OpenVZ 
> is critical to the provided services.
>
> Maybe the crash is related to hardware issues? I already replaced both 
> mainboard and CPUs, so at least broken chips are rather unlikely. 
> Below follow some details about the hardware:
>
> Tyan S7025 board
> 2x Intel Xeon L5640 CPU
> 48GB Kingston Registered ECC Memory
> LSI MegaRAID SAS 2108 RAID Controller with 29 physical drives in three 
> RAID10 arrays (with two Chenbro expander cards)
>
>
> Issues with the particular used board or CPU are rather unlikely due 
> to the replacements. But I cannot exclude hardware bugs in the used 
> series.
>
>
> I hope that you have further advice for me on how to proceed with the 
> debugging. At the moment I'm rather lost :(
>
> Kind regards,
>  jonas