[Users] Need help with hanging servers

Solar Designer solar at openwall.com
Wed Jul 7 20:26:41 EDT 2010


On Tue, Jul 06, 2010 at 10:33:36AM -0500, Brian Moon wrote:
> On no regular schedule, the two production servers will hang. And it is 
> a weird hang. They still respond to ping. And TCP connnections answer 
> (connect) but don't respond.

Most of the time (say, 70%), this indicates a disk subsystem problem
(could be anything disk-related: driver, controller, cable, disk).
If the kernel can't (re-)read a portion of program code (say, a
previously discarded or not yet loaded memory page for /usr/sbin/sshd)
from disk, the process is likely to hang with the attempted read for a
long time.  Ditto for previously swapped-out data pages (but those are
arguably less common than discarded code pages).

Less commonly (say, 20%), this is also seen after certain kernel-mode
faults ("Oops") - if a process or a thread dies on an unexpected
kernel-mode fault, but with a lock on a resource still held (so the lock
is then never released, causing other processes to bump into it).

I "reserved" another 10% for all other possible causes. ;-)

None of the above is OpenVZ-specific.

> There is nothing in syslog on the host server or any containers.

If it's a disk issue, and you only have one RAID array with both the
root fs and the logs on it, then logging will likely not work when the
issue is triggered - which is why you won't see anything in the logs.
Ideally, you'd run "dmesg", but for that you need to be able to run a
command.

> There is nothing on the console.

This is not specific enough. ;-)  Is the console screen entirely blank
or does it show, say, a login prompt?  If it's blank, then does it get
unblanked on a keypress?  (I am assuming that you're not running any
sort of GUI on the server.)

I recommend that you deactivate the kernel's built-in screensaver by
adding:

echo -e '\033[9;0]'

to /etc/rc.d/rc.local, and also issuing this command on the running
system with output redirected to /dev/console and/or /dev/tty0.  Then
the console will display the last messages even if the kernel is locked
up so badly that a keypress would not unblank the screen.

In another message, you mentioned you were using serial console.  That's
great.  Were you referring to it when you said that there was nothing on
the console?

If you suspect that the console might not be working well enough (e.g.,
not being "quick enough" to capture the last messages before the kernel
locks up too badly to continue logging even to the console), you could
also try netconsole (it uses the UDP-based syslog protocol).  In our
experience, netconsole usually eliminates the need for serial consoles.

> It sounds like a resource issue.

No, it does not.  You seem to have plenty of RAM, and I assume that you
have reasonable privvmpages and kmemsize limits set up, right?  If so,
it is unlikely that a container would unintentionally cause resource
starvation this bad.

> Linux atl-vz1 2.6.18-028stab056 #1 SMP Tue Jun 30 07:50:32 EDT 2009 
> x86_64 Intel(R) Xeon(R) CPU E5420 @ 2.50GHz GenuineIntel GNU/Linux

You really ought to upgrade to a more recent rhel5 branch kernel,
although we've been successfully running both older and newer OpenVZ
rhel5 kernels on DELL 2950s without running into any issues.  We've been
always doing custom builds (our own CONFIG_* settings), though.

> # vzlist -o ctid,kmemsize,kmemsize.l -s kmemsize

These limits are low enough (for an x86_64 system), no problem here.

I hope this helps.

<plug>
You may also consider outsourcing your sysadmin issues to us.  We're
quite used to installing and managing OpenVZ-based servers remotely,
which we've been doing for years.
</plug>

Alexander


More information about the Users mailing list