[Users] consistent re-occuring kernel oops rebooting HN

Thu Jul 23 07:41:15 EDT 2009

Sergej Kandyla пишет:
> JR Richardson пишет:
>> Hi All,
>>
>> I'm running OpenVZ on Debian Etch with 2.6.18-openvz-13-1etch5-686
>> kernel.  I have 6 HN's, identical hardware/specs.  I run 16 VE's on
>> each HN, 4 in production and 2 standby HN's to migrate VE's to during
>> maintenance.
>>
>> Here is the issue, after 4 to 5 months of operation, 1 or 2 VE's max
>> out their kernel mem, so I stop the VE for a while (10 minutes or so)
>> to clear out the kernel mem for that VE, then restart the VE, but that
>> does not seem to work, so I migrate the VE to the standby HN.  So I
>> plan a maintenance, migrate all the VE's off the HN to the standby,
>> then proceed to reboot the production HN.
>>
>> I get the oops, then have to manually power off the node to get it
>> back.  This happens consistently across all production HN's but only
>> after the nodes have been up and running with a load for several
>> weeks.  If I migrate/reboot a production HN, say within the first 30
>> days of operation, it seems to do well, only have an issue after an
>> extended period of operation.  The only indication that a HN is having
>> an issue is a VE maxes out its kernel mem and can't release it.  The
>> standby HN's, with no load on them, can reboot without an issue, even
>> after 6 months.  I've tride to duplicate this in a lab environment but
>> just can not load the HN's enough to cause the error.
>>
>> Any guidance with trouble shooting, fault isolation, kernel version
>> upgrade, any directions will be much appreciated.
>
> Hi.
> I've wrote a little howto about kernel troubleshooting with crash and 
> kdump:
> http://paix.org.ua/linux/crashdebug.html

Sorry, my howto is focused on CentOS5\RHEL5 but a procedure of debugging 
will be similar
for the Debian too.

You just need to install  a kexec-tools and crash packages and the  
kernel-debuginfo and kernel-debuginfo-common packages
for the corresponding kernel version.

>
> The main idea is to save kernel coredump and then analyse it by crash 
> tool.
> But I don't know where download kernel-debuginfo and 
> kernel-debuginfo-common packages
> for the openvz kernels.
>
> Also this info may be helpful:
> http://wiki.openvz.org/When_you_have_an_oops
>
>
>>   Any other info
>> needed?  Should this be a bug report?
>>
>> Here is the error I get, same on each HN:
>>
>> Jul 21 21:29:01 astvs1 kernel: VE: 115: stopped
>> Jul 21 21:29:19 astvs1 kernel: VE: 116: stopped
>> Jul 21 21:30:01 astvs1 /USR/SBIN/CRON[2110]: (root) CMD
>> (/usr/share/vzctl/scripts/vpsreboot)
>> Jul 21 21:30:01 astvs1 /USR/SBIN/CRON[2112]: (root) CMD
>> (/usr/share/vzctl/scripts/vpsnetclean)
>> Jul 21 21:30:15 astvs1 shutdown[2222]: shutting down for system reboot
>> Jul 21 21:30:15 astvs1 init: Switching to runlevel: 6
>> Jul 21 21:30:17 astvs1 kernel: BUG: unable to handle kernel paging
>> request at virtual address 70000059
>> Jul 21 21:30:17 astvs1 kernel:  printing eip:
>> Jul 21 21:30:17 astvs1 kernel: c017632a
>> Jul 21 21:30:17 astvs1 kernel: *pde = 00000000
>> Jul 21 21:30:17 astvs1 kernel: Oops: 0000 [#2]
>> Jul 21 21:30:17 astvs1 kernel: SMP
>> Jul 21 21:30:17 astvs1 kernel: Modules linked in: vzethdev vznetdev
>> simfs vzrst ip_nat vzcpt ip_conntrack nfn
>> etlink vzdquota vzmon vzdev xt_tcpudp xt_length ipt_ttl xt_tcpmss
>> ipt_TCPMSS iptable_mangle iptable_filter xt
>> _multiport xt_limit ipt_tos ipt_REJECT ip_tables x_tables wctdm ipv6
>> dm_snapshot dm_mirror dm_mod zttranscode
>>  ztdummy zaptel crc_ccitt loop serio_raw psmouse i2c_i801 i2c_core
>> evdev pcspkr rtc ext3 jbd mbcache sd_mod a
>> ta_piix libata scsi_mod ehci_hcd generic piix ide_core uhci_hcd
>> usbcore tg3 processor
>> Jul 21 21:30:17 astvs1 kernel: CPU:    0, VCPU: 100.0
>> Jul 21 21:30:17 astvs1 kernel: EIP:    0060:[<c017632a>]    Not 
>> tainted VLI
>> Jul 21 21:30:17 astvs1 kernel: EFLAGS: 00210202
>> (2.6.18-openvz-13-1etch5-686 #1)
>> Jul 21 21:30:17 astvs1 kernel: EIP is at iput+0x28/0x69
>> Jul 21 21:30:17 astvs1 kernel: eax: 70000045   ebx: d8a177dc   ecx:
>> d4331d74   edx: d4331d74
>> Jul 21 21:30:17 astvs1 kernel: esi: d8a177dc   edi: d41fc1a8   ebp:
>> d4331d7c   esp: f3935cb8
>> Jul 21 21:30:17 astvs1 kernel: ds: 007b   es: 007b   ss: 0068
>> Jul 21 21:30:17 astvs1 kernel: Process asterisk (pid: 18888, veid:
>> 100, ti=f3934000 task=cccda800 task.ti=f39
>> 34000)
>> Jul 21 21:30:17 astvs1 kernel: Stack: d41fc1a8 c017451b d41fc1a8
>> d4331d7c c0174676 d526b248 d4331d7c c019803b
>>
>> Jul 21 21:30:17 astvs1 kernel:        d526b248 d4331d74 d526b248
>> d526b240 d526b248 dffa0600 c01b1f30 00000000
>>
>> Jul 21 21:30:17 astvs1 kernel:        c020b201 00000000 00000000
>> d526b240 f393f300 00000008 00000000 c020b227
>>
>> Jul 21 21:30:17 astvs1 kernel:  Call Trace:
>> Jul 21 21:30:17 astvs1 kernel:  [<c017451b>] dentry_iput+0x68/0x83
>> Jul 21 21:30:17 astvs1 kernel:  [<c0174676>] dput_recursive+0xfb/0x113
>> Jul 21 21:30:17 astvs1 kernel:  [<c019803b>] 
>> sysfs_remove_dir+0x116/0x127
>> Jul 21 21:30:17 astvs1 kernel:  [<c01b1f30>] kobject_del+0x8/0x10
>> Jul 21 21:30:17 astvs1 kernel:  [<c020b201>] 
>> class_device_del+0x103/0x121
>> Jul 21 21:30:17 astvs1 kernel:  [<c020b227>] 
>> class_device_unregister+0x8/0x10
>> Jul 21 21:30:17 astvs1 kernel:  [<c01fb284>] vcs_remove_devfs+0x17/0x31
>> Jul 21 21:30:17 astvs1 kernel:  [<c01ffc4f>] con_close+0x49/0x5b
>> Jul 21 21:30:17 astvs1 kernel:  [<c01f480d>] release_dev+0x1b4/0x600
>> Jul 21 21:30:17 astvs1 kernel:  [<c013150d>] ub_page_uncharge+0x3b/0x46
>> Jul 21 21:30:17 astvs1 kernel:  [<c01f4c68>] tty_release+0xf/0x18
>> Jul 21 21:30:17 astvs1 kernel:  [<c0161970>] __fput+0x90/0x147
>> Jul 21 21:30:17 astvs1 kernel:  [<c015f561>] filp_close+0x4e/0x54
>> Jul 21 21:30:17 astvs1 kernel:  [<c011d05c>] put_files_struct+0x65/0xa7
>> Jul 21 21:30:17 astvs1 kernel:  [<c011e4bf>] do_exit+0x52f/0xb23
>> Jul 21 21:30:17 astvs1 kernel:  [<c01184b8>] 
>> fairsched_schedule+0x30a/0x5a0
>> Jul 21 21:30:17 astvs1 kernel:  [<c027d98f>] schedule+0x353/0xd29
>> Jul 21 21:30:17 astvs1 kernel:  [<c0125193>] 
>> __dequeue_signal+0x160/0x16b
>> Jul 21 21:30:17 astvs1 kernel:  [<c011eb2c>] sys_exit_group+0x0/0xd
>> Jul 21 21:30:17 astvs1 kernel:  [<c0127595>] 
>> get_signal_to_deliver+0x3c2/0x3e9
>> Jul 21 21:30:17 astvs1 kernel:  [<c0102068>] do_notify_resume+0xa3/0x609
>> Jul 21 21:30:17 astvs1 kernel:  [<c011eaa6>] do_exit+0xb16/0xb23
>> Jul 21 21:30:17 astvs1 kernel:  [<c0134676>] pb_free+0x13/0x1b
>> Jul 21 21:30:17 astvs1 kernel:  [<c0152055>] 
>> __handle_mm_fault+0x505/0x946
>> Jul 21 21:30:17 astvs1 kernel:  [<c018cbe7>] proc_flush_task+0x53/0x56
>> Jul 21 21:30:17 astvs1 kernel:  [<c011de5b>] do_wait+0x93b/0x9df
>> Jul 21 21:30:17 astvs1 kernel:  [<c0110bdf>] do_page_fault+0x186/0x46c
>> Jul 21 21:30:17 astvs1 kernel:  [<c01029ee>] work_notifysig+0x13/0x19
>> Jul 21 21:30:17 astvs1 kernel: Code: 5f 5d c3 85 c0 53 89 c3 74 60 8b
>> 80 a0 00 00 00 83 bb 5c 01 00 00 20 8b
>> 40 20 75 0b 0f 0b 66 b8 ac 04 b8 f7 c6 29 c0 85 c0 74 0b <8b> 50 14 85
>> d2 74 04 89 d8 ff d2 8d 43 24 ba 2c f8
>>  2c c0 e8 42
>> Jul 21 21:30:17 astvs1 kernel: EIP: [<c017632a>] iput+0x28/0x69 SS:ESP
>> 0068:f3935cb8
>> Jul 21 21:30:17 astvs1 kernel: Fixing recursive fault but reboot is 
>> needed!
>> Jul 21 21:39:22 astvs1 kernel: Removing netfilter NETLINK layer.
>> Jul 21 21:56:54 astvs1 -- MARK --
>> Jul 21 22:16:55 astvs1 -- MARK --
>> Jul 21 22:18:47 astvs1 syslogd 1.4.1#18: restart.
>>
>> Thanks.
>>
>> JR
>>   
>
>

-- 
Best wishes, Sergej Kandyla
Всегда улыбайтесь жизни и жизнь всегда улыбнется вам!