[Users] System hangs

Kirill Korotaev dev at sw.ru
Mon Feb 26 05:13:21 EST 2007


Arnold,

> I have two openvz servers, which both seem to like to hang 'in the 
> morning'. I've seen the problem with both the suse kernel 
> vmlinux-2.6.16.21-2.2-smp yesterday, and the stable 
> vmlinux-2.6.9-023stab040.1 today.
> 
> This time, I had some 'top's open, which report a load over >80. I can 
> SSH connect to the system, but both local and remote logins hang. 

> Interestingly, the VZs running on the machine still work, I can run 
> commands in them and they report no uptime.
sorry, what do you mean by this?
you system hangs, but VEs still work and you can login to them? or what?

> I run the vzs on reiserfs/ext3 partitions, mounted over AoE. I have the 
> feeling the kernel might actually be hanging over NFS (I use NFS to 
> share configuration and administrative files for openvz, but not for the 
> VZs themselves: running VZ on NFS mounts didn't work), but restarting 
> the NFS server doesn't help anything. I rebooted one of the hanging 
> servers, and it could access the NFS just fine afterwards, so NFS itself 
> seems to be up.
So NFS servers did hang or you just rebooted it in case?
AFAIK NFS clients are not always successfully survive NFS server reboot :/
How do you mount your NFS mount? with softmounts?

> syslog still worked, and I grabbed the following callstacks using sysrq 
> - I noticed at lot of cron processes hanging with this trace:
> 
> Feb 23 11:31:36 web2 kernel: cron          S 0000807940a0 
> 000001011ae0c050     0  3018   6456  3022    3019  3014 (NOTLB)
> Feb 23 11:31:36 web2 kernel:  0000010119c23df8 0000000000000006 
> 000001013f674f00 ffffffffa012706b
> Feb 23 11:31:36 web2 kernel:  0000000000000000 ffffffff8017c62b 
> ffffffff8054bc80 0000000000000000
> Feb 23 11:31:36 web2 kernel:  000001011ae0c050 0000807940a0edd0
> Feb 23 11:31:36 web2 kernel: Call Trace: [<ffffffffa012706b>] 
> :simfs:sim_systemcall+0x6b/0x280
> Feb 23 11:31:36 web2 kernel:  [<ffffffff8017c62b>] do_wp_page+0x44b/0x4c0
> Feb 23 11:31:36 web2 kernel:  [<ffffffff8019ceb0>] pipe_wait+0xa0/0xf0
> Feb 23 11:31:36 web2 kernel:  [<ffffffff8013b8a0>] 
> autoremove_wake_function+0x0/0x30
This calltrace looks not full. Anyway, looks like cron is simply
sleeping waiting on the pipe end. i.e. waiting for it's child
to write something to the pipe.

Can you press Altsysrq-T/AltSysRq-P and provide it's full output?
Also is the kernel compiled by your self or the binary one from openvz.org?

> None of the VZs should be running crontab as far as I know, so this 
> should be the crontab of the underlying system. I'm not sure if it 
> should even be in a simfs function?
VEs can run crons, it's fine.

> I think these are the crons that invoke vpsnetclean and vpsreboot (which 
> also occur a lot in the process list), so this probably explains the >80 
> load.
which load are you talking about? load average shown by top?
load average doesn't account for processes in S state, so you cron
doesn't influence loadavg. It accounts for only R and D state proccesses.

> 
> The stack trace of vpsreboot:
> Feb 23 11:31:44 web2 kernel: vpsreboot     D 00008ad76e6a 
> 0000010117e6e3d0     0  4316   4315                     (NOTLB)
> Feb 23 11:31:44 web2 kernel:  0000010117db9928 0000000000000006 
> 0000000000000003 ffffffff8016f624
> Feb 23 11:31:44 web2 kernel:  000001000000f380 0000000000000202 
> ffffffff8054bc80 0000000000000000
> Feb 23 11:31:44 web2 kernel:  0000010117e6e3d0 00008ad76e6abd1c
> Feb 23 11:31:44 web2 kernel: Call Trace: [<ffffffff8016f624>] 
> __alloc_collect_stats+0x54/0xc0
> Feb 23 11:31:44 web2 kernel:  [<ffffffffa00b2ec1>] 
> :sunrpc:rpc_sleep_on+0x41/0x70
> Feb 23 11:31:44 web2 kernel:  [<ffffffffa00b3bd0>] 
> :sunrpc:__rpc_execute+0x1f0/0x3c0
> Feb 23 11:31:44 web2 kernel:  [<ffffffff8013b8a0>] 
> autoremove_wake_function+0x0/0x30
> Feb 23 11:31:44 web2 kernel:  [<ffffffffa00b36c7>] 
> :sunrpc:rpc_init_task+0x157/0x1f0
> Feb 23 11:31:44 web2 kernel:  [<ffffffff8013b8a0>] 
> autoremove_wake_function+0x0/0x30
> Feb 23 11:31:44 web2 kernel:  [<ffffffffa00ae8d2>] 
> :sunrpc:rpc_call_sync+0x82/0xc0
> Feb 23 11:31:44 web2 kernel:  [<ffffffffa00fa41e>] 
> :nfs:nfs3_rpc_wrapper+0x2e/0x90
> Feb 23 11:31:44 web2 kernel:  [<ffffffffa00fabe9>] 
> :nfs:nfs3_proc_access+0x109/0x180
and here is such a proccess.
it sleeps in NFS code. So it looks like an NFS bug - client didn't restored
after NFS server reboot.


> and vpsnetclean:
> Feb 23 11:31:44 web2 kernel: vpsnetclean   D 00008ad76e6a 
> 0000010117e5ccf0     0  4318   4317                     (NOTLB)
> Feb 23 11:31:44 web2 kernel:  0000010117ed5928 0000000000000006 
> 00000101312e67a8 ffffffff8016f624
> Feb 23 11:31:44 web2 kernel:  000002000000f380 0000000000000001 
> ffffffff8054bc80 0000000000000000
> Feb 23 11:31:44 web2 kernel:  0000010117e5ccf0 00008ad76e6a901c
> Feb 23 11:31:44 web2 kernel: Call Trace: [<ffffffff8016f624>] 
> __alloc_collect_stats+0x54/0xc0
> Feb 23 11:31:44 web2 kernel:  [<ffffffffa00b2ec1>] 
> :sunrpc:rpc_sleep_on+0x41/0x70
> Feb 23 11:31:44 web2 kernel:  [<ffffffffa00b3bd0>] 
> :sunrpc:__rpc_execute+0x1f0/0x3c0
> Feb 23 11:31:44 web2 kernel:  [<ffffffff8013b8a0>] 
> autoremove_wake_function+0x0/0x30
> Feb 23 11:31:44 web2 kernel:  [<ffffffffa00b36c7>] 
> :sunrpc:rpc_init_task+0x157/0x1f0
> Feb 23 11:31:44 web2 kernel:  [<ffffffff8013b8a0>] 
> autoremove_wake_function+0x0/0x30
> Feb 23 11:31:44 web2 kernel:  [<ffffffffa00ae8d2>] 
> :sunrpc:rpc_call_sync+0x82/0xc0
> Feb 23 11:31:44 web2 kernel:  [<ffffffffa00fa41e>] 
> :nfs:nfs3_rpc_wrapper+0x2e/0x90
the same.

> Any idea what I can do to investigate this further? Could putting 
> /etc/vz and /etc/sysconfig/vz-scripts on NFS be the source of the problems ?
looks like it is :/ You can try mounting NFS with soft or intr.
this will make sure that NFS fails with errors in case of problems instead
of infinite hangs.

Thanks,
Kirill



More information about the Users mailing list