[Users] OpenVZ on EL6 - weird network issue

Sat Dec 27 12:05:45 PST 2014

Hi all,

My apologies if this might be the wrong list to ask such questions, but
we're running out of ideas with a weird and sporadic network/routing
issue on several OpenVZ hosts.

Quick rundown: We use OpenVZ since around 2006 with great success and
over the years we have deployed hundreds of OpenVZ servers for clients.
Typically for small to medium sized ISPs.

However, there is a particular set of three OpenVZ nodes (in the same
data center) that gives us network related grief without end.

This client has nodes and VPS's on private IP's and does NAT. There is a
switch and a pfSense firewall in front of them.

All nodes are pretty powerful (24 cores Intel(R) Xeon(R) CPU E5-2630L v2
@ 2.40GHz) and ~200GB of RAM and run fully yummed Scientific Linux 6.6
(64-bit) and the latest OpenVZ kernels and VZ RPMs from OpenVZ. Small
correction: One node was rolled back to 2.6.32-042stab092.2, but
afterwards it inherited the same problems yet again.

Bridged network setup on the nodes (br0) and all VPS's exclusively use
venet-style network devices. OS's inside the VPS's are diverse and range
from EL5, EL6, Fedora (various) to OpenSuSE.

Symptoms:
=========

All nodes and VPS's sporadically get unreachable from the outside. From
within the private network (from another box for example) one can still
SSH in. Pings to public IPs then still work from the nodes, but no
longer work from inside the VPS's.

This used to happen maybe once a year. Now it happens once or twice a
day and usually once one node acts up like this the others start to act
that way within the hour (or two) as well. Today all three failed within
2-3 hours and yesterday two of them failed at roughly the same time.

We did a lot of troubleshooting, doc reading and googling. But we are
pretty much at wits end and am looking for help. So far in case of a
failure the only remedy seems to be a reboot, which is (naturally) by
now rocking the boat a lot more than tolerable. Restarting the network,
dropping and re-adding routes manually and/or restarting the service
"vz" and/or individual VPS's or a combination of these simply don't
restore connectivity.

In case of failure neither routing table nor arp table on the nodes seem
to change and we're running no iptables rules on the nodes.

It could be that the problem is related to external factors, too. Or
maybe we made a boo-boo with our configuration either on the nodes or
with the network architecture in general.

We would *really* appreciate some feedback and assistance to solve this
issue.

I compiled a snapshot of the configuration and some diagnostic output at
this URL:

http://d2.smd.net/.210/host210-OpenVZ-nw-issues.txt

Any ideas or pointers? Many thanks in advance.

-- 
With best regards

Michael Stauber
mstauber at blueonyx.it