[Users] frequent kernel crashes with recent OpenVZ kernel in CentOS6

Wed Jul 3 06:12:43 EDT 2013

Hello,

I experience frequent kernel crashes with the CentOS vzkernel provided 
by OpenVZ. After installing the corresponding vzkernel-devel and 
-debuginfo packages, I now have a crash coredump available for 
debugging.

These are my first steps at debugging a kernel crash, so please bear 
with me :)

The crash seems to happen in function start_secondary() of 
arch/x86/kernel/smpboot.c. To be exact, in line 380, which is the 
closing bracket of the function.

Here's the relevant log output:

[  777.533212] [Hardware Error]: CPU 12: Machine Check Exception: 4 Bank 
2: b200000000020005
[  777.533248] [Hardware Error]: TSC 1c181a41de4
[  777.533272] [Hardware Error]: PROCESSOR 0:206c0 TIME 1372805901 
SOCKET 0 APIC 1
[  777.533304] [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 
2: b200000000020005
[  777.533335] [Hardware Error]: TSC 1c181a41d64
[  777.533358] [Hardware Error]: PROCESSOR 0:206c0 TIME 1372805901 
SOCKET 0 APIC 0
[  777.533387] [Hardware Error]: Machine check: Processor context 
corrupt
[  777.533413] Kernel panic - not syncing: Fatal Machine check
[  777.533437] Pid: 0, comm: swapper veid: 0 Tainted: G   M       
---------------    2.6.32-042stab078.27.debug #1
[  777.533476] Call Trace:
[  777.533487]  <#MC>  [<ffffffff81556011>] ? panic+0xac/0x179
[  777.533523]  [<ffffffff81026aaf>] ? mce_panic+0x20f/0x230
[  777.533547]  [<ffffffff810280f8>] ? do_machine_check+0xa38/0xa80
[  777.533576]  [<ffffffff8155b2a1>] ? machine_check+0x21/0x30
[  777.533602]  [<ffffffff813145e6>] ? intel_idle+0xb6/0x190
[  777.533624]  <<EOE>>  [<ffffffff8155e39d>] ? 
__atomic_notifier_call_chain+0x9d/0xc0
[  777.533771]  [<ffffffff8145ce68>] ? menu_select+0x178/0x360
[  777.533850]  [<ffffffff8145bce7>] ? cpuidle_idle_call+0xa7/0x170
[  777.533934]  [<ffffffff8100a06b>] ? cpu_idle+0xbb/0x110
[  777.534012]  [<ffffffff8154f783>] ? start_secondary+0x2cb/0x30e

I attached the full output of disassembled memory starting with address 
ffffffff8154f783 ('dis -rl ffffffff8154f783') to this mail.

To be honest, I don't know whether the crash is related to the OpenVZ 
patches in vzkernel at all. If I got it right, it happens when a 
secondary processor is activated, i.e. when a cpu is booted in order to 
take a process, right?.

As the crash happens on an OpenVZ kernel, I'm reporting it here first. 
Unfortunately it's not an option to try either vanilla upstream or 
CentOS mainline kernel. The server is a production system, and OpenVZ is 
critical to the provided services.

Maybe the crash is related to hardware issues? I already replaced both 
mainboard and CPUs, so at least broken chips are rather unlikely. Below 
follow some details about the hardware:

Tyan S7025 board
2x Intel Xeon L5640 CPU
48GB Kingston Registered ECC Memory
LSI MegaRAID SAS 2108 RAID Controller with 29 physical drives in three 
RAID10 arrays (with two Chenbro expander cards)

Issues with the particular used board or CPU are rather unlikely due to 
the replacements. But I cannot exclude hardware bugs in the used series.

I hope that you have further advice for me on how to proceed with the 
debugging. At the moment I'm rather lost :(

Kind regards,
  jonas
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: OpenVZ_kernel_crash_mem_dis.txt
URL: <http://lists.openvz.org/pipermail/users/attachments/20130703/f565aa7a/attachment-0001.txt>