[CRIU] HAProxy + CRIU

Wed Apr 13 05:39:44 PDT 2016

On 04/08/2016 10:35 PM, Fox, Kevin M wrote:
> A lot of clients don't reconnect/retry on failure outside of what raw tcp provides. They also cache DNS entries too long so switching around IP's in DNS entries for adjusting load balancers is a real pain, if even possible.
> 
> So, that makes Load Balancers kind of important to always have running. And keepalilved doesn't help too much other then having a quick but dirty recovery. Connections break.
> 
> But for scalability of some services, you may want to run a lot of load balancers.
> 
> So, say you have a pool of http object storage servers (swift for example):
> n[0-5]. Each is 40gig attached.
> 
> You could put a load balancer in front on separate machines, but you'd have to have a big cluster with a lot of extra bandwidth for it not to be a bottleneck.
> 
> A better option would be to have haproxy on each node. n[0-5] and have them prefer transfer to the local node.
> 
> The problem then is if you want to take a node out, say n1, its painful. you need to get the dns entry updated to remove the ip (doesn't work if your organization doesn't let you quickly do that). You then have to wait the minimum dns timeout on common os's (windows I think is still 30 minutes). then make sure all the traffic drains.
> 
> This solution is an attempt to eliminate the need to play with DNS. DNS would always have the vip's for the n[0-5] load balancers.
> 
> when, say, n1 is to be pulled out, the haproxy is live migrated to another of the machines, say n3. The connections in the haproxy still point back at n1 so they are not lost.
> 
> You then can tweak the config on the n1 haproxy to point to n3 and the server reloaded. All new connections to the n1 proxy then go to the n3 server. Once all connections between n3 and n1 are done, n1 can safely be upgraded, have hardware replaced, rebooted, etc, and brought back online.

:)

> Once maintenance of rc1 is complete, the procedure can be reversed to bring back full bandwidth access to the resources. You can cycle thorough all the nodes to upgrade them all.
> 
> So that covers live upgrading the host a haproxy is running on. The reload process that a haproxy does actually spawns a new process for the newer connections, so if you enter the container and upgrade, then reload, it will safely upgrade that piece of the software too.
> 
> So the tool provides you a way to upgrade the whole software stack live.
> 
> This only works if haproxy doesn't loose connections though, so really relies on connection repair in CRIU being solid. I'm hoping it is, which is one reason I'm posting here. Curious if there are any known gotcha's.

Well, we have several gotchas with TCP https://criu.org/TCP_repair_TODO
The biggest one there is that we support only LISTEN, CLOSED and ESTABLISHED sockets. If your
socket is in SYN_SENT opening a connection or in FIN_WAIT or something closing one, then criu
will refuse to work with it.

> I think this could be a disruptive technology in the network scalability world and is a real world showcase of CRIU's power.

Yes, this sounds promising. I'm now working on pulling the TCP code out from criu
into a separate library, so one can checkpoint and restore an individual TCP socket.
Probably this ability can be used to make your scenario faster...

> ----------------------------
> 
> As for the diskless migration, I think I may implement it eventually, but the container is only 8mb in size, and the process
> is similarly sized. Its all just about network connections with haproxy. Not much memory usage. So it was a fair amount of extra
> steps without too much benefit for the first pass.

Yup, for such a small one it makes sense.

-- Pavel