[Users] OpenVZ on EL6 - weird network issue

Sun Jan 4 17:53:18 PST 2015

Hi Scott,

> I'd like to hear more once you get it figured out.

We had two more outages in between. I then also learned that there are
not only the three OpenVZ nodes in the same class C network, but also 50
other (physical) Linux servers. So there is a certain level of noise and
congestion in that subnet, which on the average has around 16Mbit of
traffic.

The pfSense in front of it used to be clustered, but the clustering had
been disabled for troubleshooting. Still, the remaining pfSense flooded
the network with VRRP requests, which made up the majority of the
broadcast related traffic in that class C. Which might (or might not)
play a role here. You'd only see those VRRP requests (or this level of
them) in a datacenter with a similar setup.

Further segmentation of that network seems prudent, but can't be done at
the moment as it would be too disruptive. The client is considering it,
though.

Forcing the OpenVZ nodes to do hourly arpsends looked like it had helped
to improve the situation, but there were still two recorded outages. And
like before two of the three nodes would usually loose network
connection at the same time - just a few minutes apart. Not always the
same nodes, but always two of three.

We had set up some monitoring to dump the arp table and routes every
minute and to diff it during the next run. On the nodes that diff didn't
indicate any changes during the time the failures happened. The arp
table on the pfSense also remained the same.

We did run a ping from inside a VPS to the outside world and from the
outside to a VPS, doing a tcpdump on both br0 and venet0 on the node and
inside the VPS another tcpdump on venet0. If I recall correctly the ICMP
packets arrived at the node's br0, were routed to venet0, where the VPS
could see them. But on the way back the ICMP response got lost between
venet0 and br0.

We were kinda running out of options there (and patience as far as the
client's clients were concerned). So we ditched the bridges entirely and
re-configured all nodes to use eth0 directly. As we don't (at least at
the moment) use KVM we can make do without the bridges.

Open-V-Switch sounds like a *very* interesting alternative, but we
couldn't yet find the time to experiment with it. So we chose the "devil
we know", which is eth0. :p

It'll take a few days to see if that solves our issues, but so far it's
looking good. /knocking on wood

Many thanks to all who offered suggestions. Much appreciated!

-- 
With best regards

Michael Stauber