[Users] frequent kernel crashes with recent OpenVZ kernel in CentOS6
Vasily Averin
vvs at parallels.com
Wed Jul 3 09:10:57 EDT 2013
Dear Jonas,
it's hardware problem.
Machine Check Exception cannot be generated by software errors.
thank you,
Vasily Averin
On 07/03/2013 02:12 PM, Jonas Meurer wrote:
> Hello,
>
> I experience frequent kernel crashes with the CentOS vzkernel provided by OpenVZ. After installing the corresponding vzkernel-devel and -debuginfo packages, I now have a crash coredump available for debugging.
>
> These are my first steps at debugging a kernel crash, so please bear with me :)
>
> The crash seems to happen in function start_secondary() of arch/x86/kernel/smpboot.c. To be exact, in line 380, which is the closing bracket of the function.
>
> Here's the relevant log output:
>
> [ 777.533212] [Hardware Error]: CPU 12: Machine Check Exception: 4 Bank 2: b200000000020005
> [ 777.533248] [Hardware Error]: TSC 1c181a41de4
> [ 777.533272] [Hardware Error]: PROCESSOR 0:206c0 TIME 1372805901 SOCKET 0 APIC 1
> [ 777.533304] [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 2: b200000000020005
> [ 777.533335] [Hardware Error]: TSC 1c181a41d64
> [ 777.533358] [Hardware Error]: PROCESSOR 0:206c0 TIME 1372805901 SOCKET 0 APIC 0
> [ 777.533387] [Hardware Error]: Machine check: Processor context corrupt
> [ 777.533413] Kernel panic - not syncing: Fatal Machine check
> [ 777.533437] Pid: 0, comm: swapper veid: 0 Tainted: G M --------------- 2.6.32-042stab078.27.debug #1
> [ 777.533476] Call Trace:
> [ 777.533487] <#MC> [<ffffffff81556011>] ? panic+0xac/0x179
> [ 777.533523] [<ffffffff81026aaf>] ? mce_panic+0x20f/0x230
> [ 777.533547] [<ffffffff810280f8>] ? do_machine_check+0xa38/0xa80
> [ 777.533576] [<ffffffff8155b2a1>] ? machine_check+0x21/0x30
> [ 777.533602] [<ffffffff813145e6>] ? intel_idle+0xb6/0x190
> [ 777.533624] <<EOE>> [<ffffffff8155e39d>] ? __atomic_notifier_call_chain+0x9d/0xc0
> [ 777.533771] [<ffffffff8145ce68>] ? menu_select+0x178/0x360
> [ 777.533850] [<ffffffff8145bce7>] ? cpuidle_idle_call+0xa7/0x170
> [ 777.533934] [<ffffffff8100a06b>] ? cpu_idle+0xbb/0x110
> [ 777.534012] [<ffffffff8154f783>] ? start_secondary+0x2cb/0x30e
>
> I attached the full output of disassembled memory starting with address ffffffff8154f783 ('dis -rl ffffffff8154f783') to this mail.
>
> To be honest, I don't know whether the crash is related to the OpenVZ patches in vzkernel at all. If I got it right, it happens when a secondary processor is activated, i.e. when a cpu is booted in order to take a process, right?.
>
> As the crash happens on an OpenVZ kernel, I'm reporting it here first. Unfortunately it's not an option to try either vanilla upstream or CentOS mainline kernel. The server is a production system, and OpenVZ is critical to the provided services.
>
> Maybe the crash is related to hardware issues? I already replaced both mainboard and CPUs, so at least broken chips are rather unlikely. Below follow some details about the hardware:
>
> Tyan S7025 board
> 2x Intel Xeon L5640 CPU
> 48GB Kingston Registered ECC Memory
> LSI MegaRAID SAS 2108 RAID Controller with 29 physical drives in three RAID10 arrays (with two Chenbro expander cards)
>
>
> Issues with the particular used board or CPU are rather unlikely due to the replacements. But I cannot exclude hardware bugs in the used series.
>
>
> I hope that you have further advice for me on how to proceed with the debugging. At the moment I'm rather lost :(
>
> Kind regards,
> jonas
>
>
> _______________________________________________
> Users mailing list
> Users at openvz.org
> https://lists.openvz.org/mailman/listinfo/users
>
More information about the Users
mailing list