[CRIU] criu - sockets-falover investigation

Tue Nov 27 10:11:07 EST 2012

On 11/26/2012 09:48 PM, Dilip Daya wrote:
> Hi Pavel,
> 
> I shared your reply re: tcp_time_stamp with the group investigating
> sockets-failover and was very positively received.
> 
> Requesting your comments/ideas on the following wrt
> sockets-failover investigation.
> 
> (I still owe you answers to your earlier questions on the overall
> sockets-failover design...I'll provide that to your soon.) 
> 
> ---
> 
> Here are the other concerns we have about Connection Repair that are not
> already in the TODO list: http://criu.org/TCP_repair_TODO

There are 3 classes of things that should be saved/restored with TCP socket.

1. Socket options like linger, rcv/snd buffers, etc. These has nothing to do
   with the TCP repair mode itself, they are just socket options to keep an
   eye on.
2. TCP socket state, that can be automatically resurrected by TCP state
   machine. E.g. the window value is constantly changing in the connection
   time reflecting the peer current ability to receive more data. It is safe
   to restore it with some "default" value letting the further connection flow
   eventually "tune" it to the desired one. These values can be ignored by the
   TCP repair code.

In the class 2, I'd define the 2a subclass :) which is

2a. Things, that can be auto tuned by connection flow, but it's better to have
    get- and setters for them for performance reasons.

3. TCP socket state, that canNOT be automatically resurrected by TCP state
   machine. E.g. the SAK/TSTAMP flags are negotiated at the connection time
   and will not change. Thus we have to save and restore them. I agree, that
   the existing TCP repair implementation may not cover the class 3 completely.

That said, see my comments regarding the values you mentioned below.

> advmss - this is the MSS value advertised in the outgoing SYN or SYN-ACK
> when the connection is established.  It is initialized from the route
> MTU if available or to the larger of the device MTU and sysctl
> route.min_adv_mss, but never larger than the TCP_MAXSEG socket option.
> It is used for various window and memory calculations as the maximum
> size of an incoming packet. Connection Repair does not save the value,
> so if the route or device MTU is different in the migration target
> system, the calculations might not be correct. This may be acceptable if
> the remote side has PMTU discovery, since the remote will adjust to the
> new MTU.  If the remote side does not use PMTU discovery, however, it
> will continue to use the previous MTU and the calculations will be
> permanently wrong.  The local side may be able to infer which is the
> case from the DF flag of incoming packets, but Connection Repair needs
> to save the original value.

For what I see the tp->advmss reflects how big packets the route between
src and dst allows to pass. Thus, when we live-migrate a socket it's better
to set the tp->advmss according to the new route.

> rcv_wnd - this is the window advertised in outgoing packets.  It is
> initially calculated for the SYN or SYN-ACK packet and recalculated when
> received data is freed, so it might increase while the connection
> exists.  RFC 793 requires that it never decrease.  Connection Repair
> does not save the value, so it is calculated anew in the migration
> target system with no guarantee that it won't be less than that used by
> the original system. 

I don't think I fully agree with that. For what I see from the code the
tcp_transmit_skb calls tcp_select_window which in turn may change the
tp->rcv_wnd value. Thus, this thing is not constant and seem to belong
to class 2.

> SOCK_BINDADDR_LOCK, SOCK_BINDPORT_LOCK - these flags are set if an
> explicit address or port is specified in bind() and are used during
> disconnect to tell when to free the address and/or port.  When a
> connection is migrated by Connection Repair, bind() always has an
> explicit address and port specified, so the original setting is lost.
> I’m not sure what the disconnect code is trying to do, but it probably
> should be investigated.

I agree, but these bits are not TCP-specific and should be considered
outside the TCP-repair mode, i.e. belong to class 1.

> socket options - there is a bevy of socket options that Connection
> Repair does not migrate, for example:  SO_REUSEADDR, SO_RCVLOWAT,
> SO_OOBINLINE, SO_KEEPALIVE, SO_LINGER, SO_DONTROUTE, IP_TTL, IP_TOS,
> TCP_NODELAY, TCP_KEEPCNT, TCP_MAXSEG, TCP_KEEPIDLE, TCP_KEEPINTVL,
> IPV6_MTU, IPV6_UNICAST_HOPS, IPV6_V6ONLY. These are just the ones we
> need; we can save and restore them separately ourselves,
> but if Connection Repair is to become a general solution for Linux, it
> should migrate all the socket options Linux supports.

These are purely class 1 and should be get and set independently. You can
look at how crtools handle part of these options.

> mc_ttl - this is the multicast TTL and although the value is never used
> for TCP connections, it is saved for incoming connections and can be
> queried by getsockopt(). Connection Repair somewhat understandably does
> not save the value, but implementation probably should.

Multicasts have not been considered by tcp-repair and criu, I cannot advice
anything about this yet. But a quick grep over the code makes me think this
is also class 1 thing.

> Incoming traffic restart - Connection Repair does not seem to do
> anything to kick-off traffic flow from the remote side after migration.

Can you elaborate on this? What is the kick-off traffic?

> The remote will start when its retransmission timer expires, but that
> timer may have backed off to over a minute, which is kind of a long time
> to wait for traffic to flow again. By explicitly sending 3 packets with
> unchanged acknowledgments after migration, fast retransmit can be
> triggered to get incoming traffic going sooner.

If I understand you correctly, this is 2a thing. I.e. -- tcp will eventually
bring both ends up-to-date, but it will take time. I agree that something
should be done for this, but didn't plan to start with this before the core
TCP repair functionality is working. If you can participate to accomplishing
the TCP repair this will be just great!

> ---
> 
> Thanking you in advance for your time and effort.
> -DilipD.

Thanks,
Pavel

> 
> On Tue, 2012-11-20 at 22:11 +0400, Pavel Emelyanov wrote:
>> On 11/20/2012 09:45 PM, Dilip Daya wrote:
>>> Hi Pavel,
>>>
>>> On Tue, 2012-11-20 at 20:55 +0400, Pavel Emelyanov wrote:
>>>> On 11/20/2012 08:21 PM, Dilip Daya wrote:
>>>>>
>>>>> Re: tcp_time_stamp when restoring TCP connections
>>>>>
>>>>> tcp_time_stamp:
>>>>> This is the low-order 32 bits of the jiffies counter and is used to
>>>>> generate the timestamp in the TCP timestamp option.  The timestamp
>>>>> option is included in almost every outgoing TCP packet and is used for
>>>>> two purposes.
>>>>>
>>>>> For PAWS, it acts something like a high-order extension of the packet
>>>>> sequence number.  If the receiver sees a timestamp value less than what
>>>>> it previously got on the connection (modulo 2^31), it assumes the
>>>>> sequence number is from a previous wrap and discards the packet.  The
>>>>> timestamp generator must therefore never decrease.
>>>>>
>>>>> The jiffies counters on separate systems can be any value when restore
>>>>> occurs, so the restore system has a good chance of generating timestamps
>>>>> less than that of the original system and the this will start dropping
>>>>> packets.  This problem was discussed in the paper "TCP Connection
>>>>> Passing" by Werner Almesberger, which is the basis of previous
>>>>> connection migration code for Linux.  The solution given in that paper
>>>>> is to save an offset for each connection that is applied to
>>>>> tcp_time_stamp when generating the TCP timestamp option.  The offset is
>>>>> calculated to ensure the timestamps never decrease.
>>>>>
>>>>> Connection Repair does not appear to do anything special for timestamp
>>>>> generation and I'm wondering how it gets away with that.
>>>>>
>>>>> => So looking at Linux kernel code snippets, does the following take
>>>>>    care of the this issue?
>>>>
>>>> You're right, the tcp repair feature is not complete yet. Currently we're
>>>> concentrated on making it work when doing c/r without changing the kernel.
>>>> If you can participate in this things will get fixed faster :)
>>>
>>> Sure I'm ready to participate, but will tread softly with the help of
>>> your guidance.
>>>
>>> Let me very briefly explain the motivation that stemmed from your great
>>> work on c/r TCP connections: I'm investigating in a kernel-level
>>> framework for live TCP connection failover at sockets-layer and hope to
>>> make minimal kernel changes that are acceptable to c/r project as well
>>> as include any changes to crtool.
>>
>> Can you elaborate a little bit more on this? Do you try to implement a
>> system that keeps a full mirror of a socket on some other host and switch
>> to it in case of the original node crash? By full mirror I mean a socket
>> which has the state (as seen by the connection peer) exactly the same as
>> of the original one. This is slightly different from what c/r does, but
>> it makes perfect sense to come up with the unified solution for both.
>>
>>> Thanks.
>>> -DilipD.
>>>
>>>>
>>>> I think I should have a page on wiki that lists all the issues we haven't
>>>> yet fixed with tcp repair.
>>>
>>> That will be a great!
>>
>> Already did: http://criu.org/TCP_repair_TODO
>> This lists all the issues I'm aware of with the tcp repair.
>>
>>> Thanks.
>>> -DilipD.
>>>
>>>>
>>>> Thanks,
>>>> Pavel
>>>
>>> .
>>>
>>
>>
> 
> .
>