[CRIU] criu - sockets-falover investigation

Mon Nov 26 12:48:28 EST 2012

Hi Pavel,

I shared your reply re: tcp_time_stamp with the group investigating
sockets-failover and was very positively received.

Requesting your comments/ideas on the following wrt
sockets-failover investigation.

(I still owe you answers to your earlier questions on the overall
sockets-failover design...I'll provide that to your soon.) 

---

Here are the other concerns we have about Connection Repair that are not
already in the TODO list: http://criu.org/TCP_repair_TODO

advmss - this is the MSS value advertised in the outgoing SYN or SYN-ACK
when the connection is established.  It is initialized from the route
MTU if available or to the larger of the device MTU and sysctl
route.min_adv_mss, but never larger than the TCP_MAXSEG socket option.
It is used for various window and memory calculations as the maximum
size of an incoming packet. Connection Repair does not save the value,
so if the route or device MTU is different in the migration target
system, the calculations might not be correct. This may be acceptable if
the remote side has PMTU discovery, since the remote will adjust to the
new MTU.  If the remote side does not use PMTU discovery, however, it
will continue to use the previous MTU and the calculations will be
permanently wrong.  The local side may be able to infer which is the
case from the DF flag of incoming packets, but Connection Repair needs
to save the original value.

rcv_wnd - this is the window advertised in outgoing packets.  It is
initially calculated for the SYN or SYN-ACK packet and recalculated when
received data is freed, so it might increase while the connection
exists.  RFC 793 requires that it never decrease.  Connection Repair
does not save the value, so it is calculated anew in the migration
target system with no guarantee that it won't be less than that used by
the original system. 

SOCK_BINDADDR_LOCK, SOCK_BINDPORT_LOCK - these flags are set if an
explicit address or port is specified in bind() and are used during
disconnect to tell when to free the address and/or port.  When a
connection is migrated by Connection Repair, bind() always has an
explicit address and port specified, so the original setting is lost.
I’m not sure what the disconnect code is trying to do, but it probably
should be investigated.

socket options - there is a bevy of socket options that Connection
Repair does not migrate, for example:  SO_REUSEADDR, SO_RCVLOWAT,
SO_OOBINLINE, SO_KEEPALIVE, SO_LINGER, SO_DONTROUTE, IP_TTL, IP_TOS,
TCP_NODELAY, TCP_KEEPCNT, TCP_MAXSEG, TCP_KEEPIDLE, TCP_KEEPINTVL,
IPV6_MTU, IPV6_UNICAST_HOPS, IPV6_V6ONLY. These are just the ones we
need; we can save and restore them separately ourselves,
but if Connection Repair is to become a general solution for Linux, it
should migrate all the socket options Linux supports.

mc_ttl - this is the multicast TTL and although the value is never used
for TCP connections, it is saved for incoming connections and can be
queried by getsockopt(). Connection Repair somewhat understandably does
not save the value, but implementation probably should.

Incoming traffic restart - Connection Repair does not seem to do
anything to kick-off traffic flow from the remote side after migration.
The remote will start when its retransmission timer expires, but that
timer may have backed off to over a minute, which is kind of a long time
to wait for traffic to flow again. By explicitly sending 3 packets with
unchanged acknowledgments after migration, fast retransmit can be
triggered to get incoming traffic going sooner.

---

Thanking you in advance for your time and effort.
-DilipD.

On Tue, 2012-11-20 at 22:11 +0400, Pavel Emelyanov wrote:
> On 11/20/2012 09:45 PM, Dilip Daya wrote:
> > Hi Pavel,
> > 
> > On Tue, 2012-11-20 at 20:55 +0400, Pavel Emelyanov wrote:
> >> On 11/20/2012 08:21 PM, Dilip Daya wrote:
> >>>
> >>> Re: tcp_time_stamp when restoring TCP connections
> >>>
> >>> tcp_time_stamp:
> >>> This is the low-order 32 bits of the jiffies counter and is used to
> >>> generate the timestamp in the TCP timestamp option.  The timestamp
> >>> option is included in almost every outgoing TCP packet and is used for
> >>> two purposes.
> >>>
> >>> For PAWS, it acts something like a high-order extension of the packet
> >>> sequence number.  If the receiver sees a timestamp value less than what
> >>> it previously got on the connection (modulo 2^31), it assumes the
> >>> sequence number is from a previous wrap and discards the packet.  The
> >>> timestamp generator must therefore never decrease.
> >>>
> >>> The jiffies counters on separate systems can be any value when restore
> >>> occurs, so the restore system has a good chance of generating timestamps
> >>> less than that of the original system and the this will start dropping
> >>> packets.  This problem was discussed in the paper "TCP Connection
> >>> Passing" by Werner Almesberger, which is the basis of previous
> >>> connection migration code for Linux.  The solution given in that paper
> >>> is to save an offset for each connection that is applied to
> >>> tcp_time_stamp when generating the TCP timestamp option.  The offset is
> >>> calculated to ensure the timestamps never decrease.
> >>>
> >>> Connection Repair does not appear to do anything special for timestamp
> >>> generation and I'm wondering how it gets away with that.
> >>>
> >>> => So looking at Linux kernel code snippets, does the following take
> >>>    care of the this issue?
> >>
> >> You're right, the tcp repair feature is not complete yet. Currently we're
> >> concentrated on making it work when doing c/r without changing the kernel.
> >> If you can participate in this things will get fixed faster :)
> > 
> > Sure I'm ready to participate, but will tread softly with the help of
> > your guidance.
> > 
> > Let me very briefly explain the motivation that stemmed from your great
> > work on c/r TCP connections: I'm investigating in a kernel-level
> > framework for live TCP connection failover at sockets-layer and hope to
> > make minimal kernel changes that are acceptable to c/r project as well
> > as include any changes to crtool.
> 
> Can you elaborate a little bit more on this? Do you try to implement a
> system that keeps a full mirror of a socket on some other host and switch
> to it in case of the original node crash? By full mirror I mean a socket
> which has the state (as seen by the connection peer) exactly the same as
> of the original one. This is slightly different from what c/r does, but
> it makes perfect sense to come up with the unified solution for both.
> 
> > Thanks.
> > -DilipD.
> > 
> >>
> >> I think I should have a page on wiki that lists all the issues we haven't
> >> yet fixed with tcp repair.
> > 
> > That will be a great!
> 
> Already did: http://criu.org/TCP_repair_TODO
> This lists all the issues I'm aware of with the tcp repair.
> 
> > Thanks.
> > -DilipD.
> > 
> >>
> >> Thanks,
> >> Pavel
> > 
> > .
> > 
> 
>