[Users] consistent re-occuring kernel oops rebooting HN

JR Richardson jmr.richardson at gmail.com
Wed Jul 22 10:37:28 EDT 2009


Hi All,

I'm running OpenVZ on Debian Etch with 2.6.18-openvz-13-1etch5-686
kernel.  I have 6 HN's, identical hardware/specs.  I run 16 VE's on
each HN, 4 in production and 2 standby HN's to migrate VE's to during
maintenance.

Here is the issue, after 4 to 5 months of operation, 1 or 2 VE's max
out their kernel mem, so I stop the VE for a while (10 minutes or so)
to clear out the kernel mem for that VE, then restart the VE, but that
does not seem to work, so I migrate the VE to the standby HN.  So I
plan a maintenance, migrate all the VE's off the HN to the standby,
then proceed to reboot the production HN.

I get the oops, then have to manually power off the node to get it
back.  This happens consistently across all production HN's but only
after the nodes have been up and running with a load for several
weeks.  If I migrate/reboot a production HN, say within the first 30
days of operation, it seems to do well, only have an issue after an
extended period of operation.  The only indication that a HN is having
an issue is a VE maxes out its kernel mem and can't release it.  The
standby HN's, with no load on them, can reboot without an issue, even
after 6 months.  I've tride to duplicate this in a lab environment but
just can not load the HN's enough to cause the error.

Any guidance with trouble shooting, fault isolation, kernel version
upgrade, any directions will be much appreciated.  Any other info
needed?  Should this be a bug report?

Here is the error I get, same on each HN:

Jul 21 21:29:01 astvs1 kernel: VE: 115: stopped
Jul 21 21:29:19 astvs1 kernel: VE: 116: stopped
Jul 21 21:30:01 astvs1 /USR/SBIN/CRON[2110]: (root) CMD
(/usr/share/vzctl/scripts/vpsreboot)
Jul 21 21:30:01 astvs1 /USR/SBIN/CRON[2112]: (root) CMD
(/usr/share/vzctl/scripts/vpsnetclean)
Jul 21 21:30:15 astvs1 shutdown[2222]: shutting down for system reboot
Jul 21 21:30:15 astvs1 init: Switching to runlevel: 6
Jul 21 21:30:17 astvs1 kernel: BUG: unable to handle kernel paging
request at virtual address 70000059
Jul 21 21:30:17 astvs1 kernel:  printing eip:
Jul 21 21:30:17 astvs1 kernel: c017632a
Jul 21 21:30:17 astvs1 kernel: *pde = 00000000
Jul 21 21:30:17 astvs1 kernel: Oops: 0000 [#2]
Jul 21 21:30:17 astvs1 kernel: SMP
Jul 21 21:30:17 astvs1 kernel: Modules linked in: vzethdev vznetdev
simfs vzrst ip_nat vzcpt ip_conntrack nfn
etlink vzdquota vzmon vzdev xt_tcpudp xt_length ipt_ttl xt_tcpmss
ipt_TCPMSS iptable_mangle iptable_filter xt
_multiport xt_limit ipt_tos ipt_REJECT ip_tables x_tables wctdm ipv6
dm_snapshot dm_mirror dm_mod zttranscode
 ztdummy zaptel crc_ccitt loop serio_raw psmouse i2c_i801 i2c_core
evdev pcspkr rtc ext3 jbd mbcache sd_mod a
ta_piix libata scsi_mod ehci_hcd generic piix ide_core uhci_hcd
usbcore tg3 processor
Jul 21 21:30:17 astvs1 kernel: CPU:    0, VCPU: 100.0
Jul 21 21:30:17 astvs1 kernel: EIP:    0060:[<c017632a>]    Not tainted VLI
Jul 21 21:30:17 astvs1 kernel: EFLAGS: 00210202
(2.6.18-openvz-13-1etch5-686 #1)
Jul 21 21:30:17 astvs1 kernel: EIP is at iput+0x28/0x69
Jul 21 21:30:17 astvs1 kernel: eax: 70000045   ebx: d8a177dc   ecx:
d4331d74   edx: d4331d74
Jul 21 21:30:17 astvs1 kernel: esi: d8a177dc   edi: d41fc1a8   ebp:
d4331d7c   esp: f3935cb8
Jul 21 21:30:17 astvs1 kernel: ds: 007b   es: 007b   ss: 0068
Jul 21 21:30:17 astvs1 kernel: Process asterisk (pid: 18888, veid:
100, ti=f3934000 task=cccda800 task.ti=f39
34000)
Jul 21 21:30:17 astvs1 kernel: Stack: d41fc1a8 c017451b d41fc1a8
d4331d7c c0174676 d526b248 d4331d7c c019803b

Jul 21 21:30:17 astvs1 kernel:        d526b248 d4331d74 d526b248
d526b240 d526b248 dffa0600 c01b1f30 00000000

Jul 21 21:30:17 astvs1 kernel:        c020b201 00000000 00000000
d526b240 f393f300 00000008 00000000 c020b227

Jul 21 21:30:17 astvs1 kernel:  Call Trace:
Jul 21 21:30:17 astvs1 kernel:  [<c017451b>] dentry_iput+0x68/0x83
Jul 21 21:30:17 astvs1 kernel:  [<c0174676>] dput_recursive+0xfb/0x113
Jul 21 21:30:17 astvs1 kernel:  [<c019803b>] sysfs_remove_dir+0x116/0x127
Jul 21 21:30:17 astvs1 kernel:  [<c01b1f30>] kobject_del+0x8/0x10
Jul 21 21:30:17 astvs1 kernel:  [<c020b201>] class_device_del+0x103/0x121
Jul 21 21:30:17 astvs1 kernel:  [<c020b227>] class_device_unregister+0x8/0x10
Jul 21 21:30:17 astvs1 kernel:  [<c01fb284>] vcs_remove_devfs+0x17/0x31
Jul 21 21:30:17 astvs1 kernel:  [<c01ffc4f>] con_close+0x49/0x5b
Jul 21 21:30:17 astvs1 kernel:  [<c01f480d>] release_dev+0x1b4/0x600
Jul 21 21:30:17 astvs1 kernel:  [<c013150d>] ub_page_uncharge+0x3b/0x46
Jul 21 21:30:17 astvs1 kernel:  [<c01f4c68>] tty_release+0xf/0x18
Jul 21 21:30:17 astvs1 kernel:  [<c0161970>] __fput+0x90/0x147
Jul 21 21:30:17 astvs1 kernel:  [<c015f561>] filp_close+0x4e/0x54
Jul 21 21:30:17 astvs1 kernel:  [<c011d05c>] put_files_struct+0x65/0xa7
Jul 21 21:30:17 astvs1 kernel:  [<c011e4bf>] do_exit+0x52f/0xb23
Jul 21 21:30:17 astvs1 kernel:  [<c01184b8>] fairsched_schedule+0x30a/0x5a0
Jul 21 21:30:17 astvs1 kernel:  [<c027d98f>] schedule+0x353/0xd29
Jul 21 21:30:17 astvs1 kernel:  [<c0125193>] __dequeue_signal+0x160/0x16b
Jul 21 21:30:17 astvs1 kernel:  [<c011eb2c>] sys_exit_group+0x0/0xd
Jul 21 21:30:17 astvs1 kernel:  [<c0127595>] get_signal_to_deliver+0x3c2/0x3e9
Jul 21 21:30:17 astvs1 kernel:  [<c0102068>] do_notify_resume+0xa3/0x609
Jul 21 21:30:17 astvs1 kernel:  [<c011eaa6>] do_exit+0xb16/0xb23
Jul 21 21:30:17 astvs1 kernel:  [<c0134676>] pb_free+0x13/0x1b
Jul 21 21:30:17 astvs1 kernel:  [<c0152055>] __handle_mm_fault+0x505/0x946
Jul 21 21:30:17 astvs1 kernel:  [<c018cbe7>] proc_flush_task+0x53/0x56
Jul 21 21:30:17 astvs1 kernel:  [<c011de5b>] do_wait+0x93b/0x9df
Jul 21 21:30:17 astvs1 kernel:  [<c0110bdf>] do_page_fault+0x186/0x46c
Jul 21 21:30:17 astvs1 kernel:  [<c01029ee>] work_notifysig+0x13/0x19
Jul 21 21:30:17 astvs1 kernel: Code: 5f 5d c3 85 c0 53 89 c3 74 60 8b
80 a0 00 00 00 83 bb 5c 01 00 00 20 8b
40 20 75 0b 0f 0b 66 b8 ac 04 b8 f7 c6 29 c0 85 c0 74 0b <8b> 50 14 85
d2 74 04 89 d8 ff d2 8d 43 24 ba 2c f8
 2c c0 e8 42
Jul 21 21:30:17 astvs1 kernel: EIP: [<c017632a>] iput+0x28/0x69 SS:ESP
0068:f3935cb8
Jul 21 21:30:17 astvs1 kernel: Fixing recursive fault but reboot is needed!
Jul 21 21:39:22 astvs1 kernel: Removing netfilter NETLINK layer.
Jul 21 21:56:54 astvs1 -- MARK --
Jul 21 22:16:55 astvs1 -- MARK --
Jul 21 22:18:47 astvs1 syslogd 1.4.1#18: restart.

Thanks.

JR
-- 
JR Richardson
Engineering for the Masses


More information about the Users mailing list