[Users] consistent re-occuring kernel oops rebooting HN

Sergej Kandyla sk.paix at gmail.com
Thu Jul 23 07:34:11 EDT 2009


JR Richardson пишет:
> Hi All,
>
> I'm running OpenVZ on Debian Etch with 2.6.18-openvz-13-1etch5-686
> kernel.  I have 6 HN's, identical hardware/specs.  I run 16 VE's on
> each HN, 4 in production and 2 standby HN's to migrate VE's to during
> maintenance.
>
> Here is the issue, after 4 to 5 months of operation, 1 or 2 VE's max
> out their kernel mem, so I stop the VE for a while (10 minutes or so)
> to clear out the kernel mem for that VE, then restart the VE, but that
> does not seem to work, so I migrate the VE to the standby HN.  So I
> plan a maintenance, migrate all the VE's off the HN to the standby,
> then proceed to reboot the production HN.
>
> I get the oops, then have to manually power off the node to get it
> back.  This happens consistently across all production HN's but only
> after the nodes have been up and running with a load for several
> weeks.  If I migrate/reboot a production HN, say within the first 30
> days of operation, it seems to do well, only have an issue after an
> extended period of operation.  The only indication that a HN is having
> an issue is a VE maxes out its kernel mem and can't release it.  The
> standby HN's, with no load on them, can reboot without an issue, even
> after 6 months.  I've tride to duplicate this in a lab environment but
> just can not load the HN's enough to cause the error.
>
> Any guidance with trouble shooting, fault isolation, kernel version
> upgrade, any directions will be much appreciated.

Hi.
I've wrote a little howto about kernel troubleshooting with crash and kdump:
http://paix.org.ua/linux/crashdebug.html

The main idea is to save kernel coredump and then analyse it by crash tool.
But I don't know where download kernel-debuginfo and 
kernel-debuginfo-common packages
for the openvz kernels.

Also this info may be helpful:
http://wiki.openvz.org/When_you_have_an_oops


>   Any other info
> needed?  Should this be a bug report?
>
> Here is the error I get, same on each HN:
>
> Jul 21 21:29:01 astvs1 kernel: VE: 115: stopped
> Jul 21 21:29:19 astvs1 kernel: VE: 116: stopped
> Jul 21 21:30:01 astvs1 /USR/SBIN/CRON[2110]: (root) CMD
> (/usr/share/vzctl/scripts/vpsreboot)
> Jul 21 21:30:01 astvs1 /USR/SBIN/CRON[2112]: (root) CMD
> (/usr/share/vzctl/scripts/vpsnetclean)
> Jul 21 21:30:15 astvs1 shutdown[2222]: shutting down for system reboot
> Jul 21 21:30:15 astvs1 init: Switching to runlevel: 6
> Jul 21 21:30:17 astvs1 kernel: BUG: unable to handle kernel paging
> request at virtual address 70000059
> Jul 21 21:30:17 astvs1 kernel:  printing eip:
> Jul 21 21:30:17 astvs1 kernel: c017632a
> Jul 21 21:30:17 astvs1 kernel: *pde = 00000000
> Jul 21 21:30:17 astvs1 kernel: Oops: 0000 [#2]
> Jul 21 21:30:17 astvs1 kernel: SMP
> Jul 21 21:30:17 astvs1 kernel: Modules linked in: vzethdev vznetdev
> simfs vzrst ip_nat vzcpt ip_conntrack nfn
> etlink vzdquota vzmon vzdev xt_tcpudp xt_length ipt_ttl xt_tcpmss
> ipt_TCPMSS iptable_mangle iptable_filter xt
> _multiport xt_limit ipt_tos ipt_REJECT ip_tables x_tables wctdm ipv6
> dm_snapshot dm_mirror dm_mod zttranscode
>  ztdummy zaptel crc_ccitt loop serio_raw psmouse i2c_i801 i2c_core
> evdev pcspkr rtc ext3 jbd mbcache sd_mod a
> ta_piix libata scsi_mod ehci_hcd generic piix ide_core uhci_hcd
> usbcore tg3 processor
> Jul 21 21:30:17 astvs1 kernel: CPU:    0, VCPU: 100.0
> Jul 21 21:30:17 astvs1 kernel: EIP:    0060:[<c017632a>]    Not tainted VLI
> Jul 21 21:30:17 astvs1 kernel: EFLAGS: 00210202
> (2.6.18-openvz-13-1etch5-686 #1)
> Jul 21 21:30:17 astvs1 kernel: EIP is at iput+0x28/0x69
> Jul 21 21:30:17 astvs1 kernel: eax: 70000045   ebx: d8a177dc   ecx:
> d4331d74   edx: d4331d74
> Jul 21 21:30:17 astvs1 kernel: esi: d8a177dc   edi: d41fc1a8   ebp:
> d4331d7c   esp: f3935cb8
> Jul 21 21:30:17 astvs1 kernel: ds: 007b   es: 007b   ss: 0068
> Jul 21 21:30:17 astvs1 kernel: Process asterisk (pid: 18888, veid:
> 100, ti=f3934000 task=cccda800 task.ti=f39
> 34000)
> Jul 21 21:30:17 astvs1 kernel: Stack: d41fc1a8 c017451b d41fc1a8
> d4331d7c c0174676 d526b248 d4331d7c c019803b
>
> Jul 21 21:30:17 astvs1 kernel:        d526b248 d4331d74 d526b248
> d526b240 d526b248 dffa0600 c01b1f30 00000000
>
> Jul 21 21:30:17 astvs1 kernel:        c020b201 00000000 00000000
> d526b240 f393f300 00000008 00000000 c020b227
>
> Jul 21 21:30:17 astvs1 kernel:  Call Trace:
> Jul 21 21:30:17 astvs1 kernel:  [<c017451b>] dentry_iput+0x68/0x83
> Jul 21 21:30:17 astvs1 kernel:  [<c0174676>] dput_recursive+0xfb/0x113
> Jul 21 21:30:17 astvs1 kernel:  [<c019803b>] sysfs_remove_dir+0x116/0x127
> Jul 21 21:30:17 astvs1 kernel:  [<c01b1f30>] kobject_del+0x8/0x10
> Jul 21 21:30:17 astvs1 kernel:  [<c020b201>] class_device_del+0x103/0x121
> Jul 21 21:30:17 astvs1 kernel:  [<c020b227>] class_device_unregister+0x8/0x10
> Jul 21 21:30:17 astvs1 kernel:  [<c01fb284>] vcs_remove_devfs+0x17/0x31
> Jul 21 21:30:17 astvs1 kernel:  [<c01ffc4f>] con_close+0x49/0x5b
> Jul 21 21:30:17 astvs1 kernel:  [<c01f480d>] release_dev+0x1b4/0x600
> Jul 21 21:30:17 astvs1 kernel:  [<c013150d>] ub_page_uncharge+0x3b/0x46
> Jul 21 21:30:17 astvs1 kernel:  [<c01f4c68>] tty_release+0xf/0x18
> Jul 21 21:30:17 astvs1 kernel:  [<c0161970>] __fput+0x90/0x147
> Jul 21 21:30:17 astvs1 kernel:  [<c015f561>] filp_close+0x4e/0x54
> Jul 21 21:30:17 astvs1 kernel:  [<c011d05c>] put_files_struct+0x65/0xa7
> Jul 21 21:30:17 astvs1 kernel:  [<c011e4bf>] do_exit+0x52f/0xb23
> Jul 21 21:30:17 astvs1 kernel:  [<c01184b8>] fairsched_schedule+0x30a/0x5a0
> Jul 21 21:30:17 astvs1 kernel:  [<c027d98f>] schedule+0x353/0xd29
> Jul 21 21:30:17 astvs1 kernel:  [<c0125193>] __dequeue_signal+0x160/0x16b
> Jul 21 21:30:17 astvs1 kernel:  [<c011eb2c>] sys_exit_group+0x0/0xd
> Jul 21 21:30:17 astvs1 kernel:  [<c0127595>] get_signal_to_deliver+0x3c2/0x3e9
> Jul 21 21:30:17 astvs1 kernel:  [<c0102068>] do_notify_resume+0xa3/0x609
> Jul 21 21:30:17 astvs1 kernel:  [<c011eaa6>] do_exit+0xb16/0xb23
> Jul 21 21:30:17 astvs1 kernel:  [<c0134676>] pb_free+0x13/0x1b
> Jul 21 21:30:17 astvs1 kernel:  [<c0152055>] __handle_mm_fault+0x505/0x946
> Jul 21 21:30:17 astvs1 kernel:  [<c018cbe7>] proc_flush_task+0x53/0x56
> Jul 21 21:30:17 astvs1 kernel:  [<c011de5b>] do_wait+0x93b/0x9df
> Jul 21 21:30:17 astvs1 kernel:  [<c0110bdf>] do_page_fault+0x186/0x46c
> Jul 21 21:30:17 astvs1 kernel:  [<c01029ee>] work_notifysig+0x13/0x19
> Jul 21 21:30:17 astvs1 kernel: Code: 5f 5d c3 85 c0 53 89 c3 74 60 8b
> 80 a0 00 00 00 83 bb 5c 01 00 00 20 8b
> 40 20 75 0b 0f 0b 66 b8 ac 04 b8 f7 c6 29 c0 85 c0 74 0b <8b> 50 14 85
> d2 74 04 89 d8 ff d2 8d 43 24 ba 2c f8
>  2c c0 e8 42
> Jul 21 21:30:17 astvs1 kernel: EIP: [<c017632a>] iput+0x28/0x69 SS:ESP
> 0068:f3935cb8
> Jul 21 21:30:17 astvs1 kernel: Fixing recursive fault but reboot is needed!
> Jul 21 21:39:22 astvs1 kernel: Removing netfilter NETLINK layer.
> Jul 21 21:56:54 astvs1 -- MARK --
> Jul 21 22:16:55 astvs1 -- MARK --
> Jul 21 22:18:47 astvs1 syslogd 1.4.1#18: restart.
>
> Thanks.
>
> JR
>   


-- 
Best wishes, Sergej Kandyla




More information about the Users mailing list