[Users] System hangs

Arnold Hendriks a.hendriks at b-lex.nl
Fri Feb 23 06:10:57 EST 2007


I have two openvz servers, which both seem to like to hang 'in the 
morning'. I've seen the problem with both the suse kernel 
vmlinux-2.6.16.21-2.2-smp yesterday, and the stable 
vmlinux-2.6.9-023stab040.1 today.

This time, I had some 'top's open, which report a load over >80. I can 
SSH connect to the system, but both local and remote logins hang. 
Interestingly, the VZs running on the machine still work, I can run 
commands in them and they report no uptime.

I run the vzs on reiserfs/ext3 partitions, mounted over AoE. I have the 
feeling the kernel might actually be hanging over NFS (I use NFS to 
share configuration and administrative files for openvz, but not for the 
VZs themselves: running VZ on NFS mounts didn't work), but restarting 
the NFS server doesn't help anything. I rebooted one of the hanging 
servers, and it could access the NFS just fine afterwards, so NFS itself 
seems to be up.

syslog still worked, and I grabbed the following callstacks using sysrq 
- I noticed at lot of cron processes hanging with this trace:

Feb 23 11:31:36 web2 kernel: cron          S 0000807940a0 
000001011ae0c050     0  3018   6456  3022    3019  3014 (NOTLB)
Feb 23 11:31:36 web2 kernel:  0000010119c23df8 0000000000000006 
000001013f674f00 ffffffffa012706b
Feb 23 11:31:36 web2 kernel:  0000000000000000 ffffffff8017c62b 
ffffffff8054bc80 0000000000000000
Feb 23 11:31:36 web2 kernel:  000001011ae0c050 0000807940a0edd0
Feb 23 11:31:36 web2 kernel: Call Trace: [<ffffffffa012706b>] 
:simfs:sim_systemcall+0x6b/0x280
Feb 23 11:31:36 web2 kernel:  [<ffffffff8017c62b>] do_wp_page+0x44b/0x4c0
Feb 23 11:31:36 web2 kernel:  [<ffffffff8019ceb0>] pipe_wait+0xa0/0xf0
Feb 23 11:31:36 web2 kernel:  [<ffffffff8013b8a0>] 
autoremove_wake_function+0x0/0x30

....

None of the VZs should be running crontab as far as I know, so this 
should be the crontab of the underlying system. I'm not sure if it 
should even be in a simfs function?

I think these are the crons that invoke vpsnetclean and vpsreboot (which 
also occur a lot in the process list), so this probably explains the >80 
load.

The stack trace of vpsreboot:
Feb 23 11:31:44 web2 kernel: vpsreboot     D 00008ad76e6a 
0000010117e6e3d0     0  4316   4315                     (NOTLB)
Feb 23 11:31:44 web2 kernel:  0000010117db9928 0000000000000006 
0000000000000003 ffffffff8016f624
Feb 23 11:31:44 web2 kernel:  000001000000f380 0000000000000202 
ffffffff8054bc80 0000000000000000
Feb 23 11:31:44 web2 kernel:  0000010117e6e3d0 00008ad76e6abd1c
Feb 23 11:31:44 web2 kernel: Call Trace: [<ffffffff8016f624>] 
__alloc_collect_stats+0x54/0xc0
Feb 23 11:31:44 web2 kernel:  [<ffffffffa00b2ec1>] 
:sunrpc:rpc_sleep_on+0x41/0x70
Feb 23 11:31:44 web2 kernel:  [<ffffffffa00b3bd0>] 
:sunrpc:__rpc_execute+0x1f0/0x3c0
Feb 23 11:31:44 web2 kernel:  [<ffffffff8013b8a0>] 
autoremove_wake_function+0x0/0x30
Feb 23 11:31:44 web2 kernel:  [<ffffffffa00b36c7>] 
:sunrpc:rpc_init_task+0x157/0x1f0
Feb 23 11:31:44 web2 kernel:  [<ffffffff8013b8a0>] 
autoremove_wake_function+0x0/0x30
Feb 23 11:31:44 web2 kernel:  [<ffffffffa00ae8d2>] 
:sunrpc:rpc_call_sync+0x82/0xc0
Feb 23 11:31:44 web2 kernel:  [<ffffffffa00fa41e>] 
:nfs:nfs3_rpc_wrapper+0x2e/0x90
Feb 23 11:31:44 web2 kernel:  [<ffffffffa00fabe9>] 
:nfs:nfs3_proc_access+0x109/0x180

and vpsnetclean:
Feb 23 11:31:44 web2 kernel: vpsnetclean   D 00008ad76e6a 
0000010117e5ccf0     0  4318   4317                     (NOTLB)
Feb 23 11:31:44 web2 kernel:  0000010117ed5928 0000000000000006 
00000101312e67a8 ffffffff8016f624
Feb 23 11:31:44 web2 kernel:  000002000000f380 0000000000000001 
ffffffff8054bc80 0000000000000000
Feb 23 11:31:44 web2 kernel:  0000010117e5ccf0 00008ad76e6a901c
Feb 23 11:31:44 web2 kernel: Call Trace: [<ffffffff8016f624>] 
__alloc_collect_stats+0x54/0xc0
Feb 23 11:31:44 web2 kernel:  [<ffffffffa00b2ec1>] 
:sunrpc:rpc_sleep_on+0x41/0x70
Feb 23 11:31:44 web2 kernel:  [<ffffffffa00b3bd0>] 
:sunrpc:__rpc_execute+0x1f0/0x3c0
Feb 23 11:31:44 web2 kernel:  [<ffffffff8013b8a0>] 
autoremove_wake_function+0x0/0x30
Feb 23 11:31:44 web2 kernel:  [<ffffffffa00b36c7>] 
:sunrpc:rpc_init_task+0x157/0x1f0
Feb 23 11:31:44 web2 kernel:  [<ffffffff8013b8a0>] 
autoremove_wake_function+0x0/0x30
Feb 23 11:31:44 web2 kernel:  [<ffffffffa00ae8d2>] 
:sunrpc:rpc_call_sync+0x82/0xc0
Feb 23 11:31:44 web2 kernel:  [<ffffffffa00fa41e>] 
:nfs:nfs3_rpc_wrapper+0x2e/0x90

Any idea what I can do to investigate this further? Could putting 
/etc/vz and /etc/sysconfig/vz-scripts on NFS be the source of the problems ?




More information about the Users mailing list