[Users] kernel panic on 2.6.18-92.1.13.el5.028stab059.6PAE + aoe.ko

Mark Sutton mark at fubra.com
Fri Feb 6 11:53:28 EST 2009


Hello

We've been running the ovz kernel for a while, but we're experiencing  
random kernel panics on a moderately loaded server.

The server is Centos 5.2, 2.6.18-92.1.13.el5.028stab059.6PAE kernel  
running the ata-over-ethernet driver version aoe6-69 from http://support.coraid.com/support/linux/

The server in question ran a memtest for over 48 hours and showed  
nothing, so I don't suspect memory errors...

I have several servers with 'normal' centos kernel (no openvz) with  
aoe driver accessing storage heavily, these have been running stable  
for months. As soon as ovz kernel is booted the machine starts crashing.

I suspect there is a problem between openvz kernel patches and the aoe  
driver, but attempts to track down the problem with the aoe driver  
maintainer have been unsuccessful so far...

Here are his last comments in case it helps any:

> When I looked at your last trace, it appeared that the list of free
> pages was the one containing the list corruption that triggered the
> BUG inside of list_del.  Backtraces are often incomplete, and here is
> an expanded version of what I think happened between
> get_page_from_freelist and the BUG in list_del.
>
>  mm/page_alloc.c:	get_page_from_freelist calls buffered_rmqueue
>
>  mm/page_alloc.c:	buffered_rmqueue calls list_del(&page->lru);
>
>  lib/list_debug.c:	list_del finds that ...
> 			page->lru->prev->next == page->lru, but
> 			page->lru->next->prev != page->lru
>
> It could mean that something is keeping a pointer to a page on the
> free list after it has been freed.  That's why I thought the patches
> related to the page->_count would be relevant: the aoe driver has to
> manipulate the count so that the put_page done by the network layer
> doesn't free the page (associated with an AoE read or write) before
> the block layer is done with it.


I found the following link http://bugs.centos.org/view.php?id=3192  
pointing at a similar bug/error in the centos kernel but this problem  
appeared to go away for that user in -92.1.17. So I tried upgrading to  
2.6.18-92.1.18.el5.028stab060.2PAE yesterday but it has since crashed  
two more times.

I have lots and lots of crash traces, but they all look pretty much  
the same so I will just post one below. If anyone has any ideas what  
might be going on here I'd be very interested. If any more info is  
needed, just let me know.

There are some notes here about the setup process I used for this  
server: http://code.fubra.com/wiki/AoeOpenvzTesting

Thanks!

Mark

 >>TRACE FOLLOWS>>
list_add corruption. next->prev should be c06cdb0c, but was c14c6170
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:26!
invalid opcode: 0000 [#1]
SMP
last sysfs file:
Modules linked in: nfs(U) lockd(U) nfs_acl(U) vzethdev(U) vznetdev(U)  
simfs(U) vzrst(U) ip_nat(U) vzcpt(U) ip_conntrack(U) nfnetlink(U)  
ipip(U) tunnel4(U) tun(U) vzmon(U) xt_tcpudp(U) xt_length(U)  
ipt_ttl(U) xt_tcpmss(U) ipt_TCPMSS(U) iptable_mangle(U)  
iptable_filter(U) xt_multiport(U) xt_limit(U) ipt_tos(U) ipt_REJECT(U)  
ip_tables(U) x_tables(U) aoe(U) vzdquota(U) autofs4(U) hidp(U)  
rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) 8021q(U) ipv6(U)  
xfrm_nalgo(U) crypto_api(U) vzdev(U) dm_multipath(U) video(U) sbs(U)  
backlight(U) i2c_ec(U) container(U) button(U) battery(U) asus_acpi(U)  
ac(U) parport_pc(U) lp(U) parport(U) e7xxx_edac(U) edac_mc(U) e100(U)  
i2c_i801(U) serio_raw(U) eepro100(U) i2c_core(U) mii(U) e1000(U)  
pcspkr(U) e1000e(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U)  
ata_piix(U) libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U)  
ohci_hcd(U) uhci_hcd(U)
CPU:    0, VCPU: 0.0
EIP:    0060:[<c04db0be>]    Tainted:  P      VLI
EFLAGS: 00010046   (2.6.18-92.1.18.el5.028stab060.2PAE #1 028stab060)
EIP is at __list_add+0x1c/0x58
eax: 00000048   ebx: c14a3c28   ecx: f517dea0   edx: c063572e
esi: c06cdb0c   edi: c14ead1c   ebp: c06cdb0c   esp: f517de9c
ds: 007b   es: 007b   ss: 0068
Process aoe_ktio (pid: 8451, veid: 0, ti=f517c000 task=f7c3a670  
task.ti=f517c000)
Stack: c063572e c06cdb0c c14c6170 c14ead04 c06cdb00 00000046 c0457b85  
c06cda80
       f7f4afc0 e1c863a0 c14ead04 00000000 c0456560 e0c64380 e1c863a0  
e0c64300
       c045dd33 00000000 f7f4afc0 00000000 e0c64300 00001000 c045de6e  
00000000
Call Trace:
[<c0457b85>] free_hot_cold_page+0x110/0x13a
[<c0456560>] mempool_free+0x5f/0x63
[<c045dd33>] bounce_end_io+0x51/0x80
[<c045de6e>] bounce_end_io_write+0x0/0x1b
[<c045de84>] bounce_end_io_write+0x16/0x1b
[<c0476c05>] bio_endio+0x50/0x55
[<c04c9e20>] __end_that_request_first+0x185/0x47c
[<f904f542>] aoe_end_request+0x46/0x78 [aoe]
[<f9050def>] ktio+0x3dc/0x475 [aoe]
[<f9050730>] kthread+0x87/0xdf [aoe]
[<c041acd2>] default_wake_function+0x0/0xc
[<f90506a9>] kthread+0x0/0xdf [aoe]
[<c04349db>] kthread+0xc0/0xed
[<c043491b>] kthread+0x0/0xed
[<c0607fcb>] kernel_thread_helper+0x7/0x10
=======================
Code: c7 43 04 00 02 20 00 c7 03 00 01 10 00 5b c3 57 89 c7 56 89 d6  
53 8b 41 04 89 cb 39 d0 74 1a 50 52 68 2e 57 63 c0 e8 a3 7f f4 ff <0f>  
0b 66 b8 1a 00 b8 e0 56 63 c0 83 c4 0c 8b 06 39 d8 74 1a 50
EIP: [<c04db0be>] __list_add+0x1c/0x58 SS:ESP 0068:f517de9c
Kernel panic - not syncing: Fatal exception
BUG: warning at arch/i386/kernel/smp.c:543/smp_call_function()  
(Tainted:  P     )
[<c04115bc>] stop_this_cpu+0x0/0x33
[<c04112ef>] smp_call_function+0x57/0xc3
[<c0423079>] printk+0x18/0x8e
[<c041136e>] smp_send_stop+0x13/0x1c
[<c04222cf>] panic+0x4c/0x165
[<c04056dc>] die+0x249/0x260
[<c0405e16>] do_invalid_op+0x0/0x9d
[<c0405ea7>] do_invalid_op+0x91/0x9d
[<c04db0be>] __list_add+0x1c/0x58
[<c0422833>] wake_up_klogd+0x2b/0x2d
[<c042287c>] release_console_sem+0x47/0x4a
[<c046e07b>] free_block+0xd0/0xef
[<c046e1a5>] cache_flusharray+0x76/0xa2
[<c0607dfb>] error_code+0x4f/0x54
[<c04db0be>] __list_add+0x1c/0x58
[<c0457b85>] free_hot_cold_page+0x110/0x13a
[<c0456560>] mempool_free+0x5f/0x63
[<c045dd33>] bounce_end_io+0x51/0x80
[<c045de6e>] bounce_end_io_write+0x0/0x1b
[<c045de84>] bounce_end_io_write+0x16/0x1b
[<c0476c05>] bio_endio+0x50/0x55
[<c04c9e20>] __end_that_request_first+0x185/0x47c
[<f904f542>] aoe_end_request+0x46/0x78 [aoe]
[<f9050def>] ktio+0x3dc/0x475 [aoe]
[<f9050730>] kthread+0x87/0xdf [aoe]
[<c041acd2>] default_wake_function+0x0/0xc
[<f90506a9>] kthread+0x0/0xdf [aoe]
[<c04349db>] kthread+0xc0/0xed
[<c043491b>] kthread+0x0/0xed
[<c0607fcb>] kernel_thread_helper+0x7/0x10
=======================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://openvz.org/pipermail/users/attachments/20090206/2ff55e15/attachment.html


More information about the Users mailing list