[CRIU] [Users] socket will take at least 0.5 seconds to recovery after docker restore done

Tue Jul 21 08:25:35 PDT 2015

thanks Saied!
The delay is happened after restore. The -existing- socket(s) cannot work,
but at the same time ping the container IP is OK.

So I am confused that the delay is caused by Docker or CRIU? Because it
seems that it only happened in Docker migration with CRIU

Saied Kazemi <saied at google.com>于2015年7月21日周二 00:29写道：

> Hi Yanbao,
>
> In my opinion, running in its own network namespace and using veth devices
> should not cause a delay.  The delay is perhaps related to moving the veth
> device into the bridge (docker0) and setting up iptables rules to forward
> traffic to the container.  If you're interested, you can actually verify
> this by adding simple log output to both Docker and CRIU and looking at
> time stamps.
>
> As I mentioned previously, the network code has been refactored into
> libnetwork.  Therefore, it doesn't make sense to spend time on Docker 1.5
> networking code.  As soon as time allows, I will look into Docker 1.8 and
> libnetwork but I am afraid it won't be soon.
>
> Sorry that I cannot help more at this time.
>
> --Saied
>
>
> On Mon, Jul 20, 2015 at 12:24 AM, Yanbao Cui <yygcui at gmail.com> wrote:
>
>> Hi Saied,
>>
>> Another test shows that if I only run a process with socket connection,
>> the connection will recover quicky after C/R.
>>
>> If run it in Docker, the recovered connection will hang at least 0.5
>> second.
>>
>> The difference is Docker use separate namespaces and use veth pair for
>> network
>>
>> Could you help check it in Docker side?
>>
>> Yanbao Cui <yygcui at gmail.com>于2015年7月18日周六 08:50写道：
>>
>>> yeah, but i did't use Docker 1.5 before, so I don't know if it is ok.
>>>
>>> Actually I think the network is restored by CRIU, rather than create a
>>> new one by Docker.
>>>
>>> In my logic, we only rebuild the container object in Docker daemon, the
>>> entire container is restored by CRIU, include processes, network, etc.
>>>
>>> So I think the problem is NOT caused by Docker, as I describe in the
>>> last mails that after restore the docker container is work well except the
>>> _existing_ socket restored hang at least 0.5 seconds
>>>
>>> On Sat, Jul 18, 2015 at 12:30 AM, Saied Kazemi <saied at google.com> wrote:
>>>
>>>> On Fri, Jul 17, 2015 at 8:50 AM, Yanbao Cui <yygcui at gmail.com> wrote:
>>>>
>>>> I use the latter one. And we integrade C/R functionality into Docker
>>>>> based on https://github.com/SaiedKazemi/docker/wiki
>>>>>
>>>>
>>>>  So, you rebased Docker 1.5 code to 1.6?  Did you see the issue in
>>>> Docker 1.5?
>>>>
>>>>
>>>> And I found there is another one based on Docker 1.7
>>>>>
>>>>
>>>> Up until Docker 1.5, the network code was both in libcontainer and in
>>>> the Docker engine.  Now all network logic is in libnetwork, so there's no
>>>> point spending time on older versions.  Unfortunately I haven't had time
>>>> yet to familiarize myself with the new code.  Going forward, I suggest that
>>>> you use the Docker 1.7.  It's a rebase of 1.5 to the head and is under
>>>> active development by Ross Boucher (rboucher at gmail.com) and other
>>>> community members as I am sure you know.
>>>>
>>>>
>>>>
>>>>> Did you guys test it and focus on the time consumed?
>>>>>
>>>>
>>>> No my concentration was on getting the network to restore
>>>> successfully.  Didn't make any time measurements.
>>>>
>>>> --Saied
>>>>
>>>>
>>>>
>>>>> On Fri, Jul 17, 2015 at 10:45 PM, Saied Kazemi <saied at google.com>
>>>>> wrote:
>>>>>
>>>>>> Are you doing external checkpoint restore, calling CRIU directly to
>>>>>> dump and restore the container, or are you using native "docker checkpoint"
>>>>>> and "docker restore" commands?  If latter, did you integrate C/R
>>>>>> functionality into Docker yourself?
>>>>>>
>>>>>> --Saied
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 17, 2015 at 5:56 AM, Yanbao Cui <yygcui at gmail.com> wrote:
>>>>>>
>>>>>>> I use docker 1.6.0 and 1.6.2, they all have this problem.
>>>>>>>
>>>>>>> the needed files are shared via NFS.
>>>>>>>
>>>>>>> On Fri, Jul 17, 2015 at 11:15 AM, Saied Kazemi <saied at google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Which Docker version are you using to checkpoint and restore your
>>>>>>>> containers?  Also, for migration, are you manually copying the container to
>>>>>>>> a target machine?
>>>>>>>>
>>>>>>>> --Saied
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 14, 2015 at 7:36 AM, Yanbao Cui <yygcui at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Correct my reply:
>>>>>>>>>
>>>>>>>>> _existing_ migrated connections hang.
>>>>>>>>>
>>>>>>>>> New connection (here I mean new socket or a new process, not as
>>>>>>>>> like reconnection manually) is OK
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yanbao Cui <yygcui at gmail.com>于2015年7月14日 周二 22:07写道：
>>>>>>>>>
>>>>>>>>>> _existing_ migrated connections hang.
>>>>>>>>>>
>>>>>>>>>> New connection is OK
>>>>>>>>>>
>>>>>>>>>> Pavel Emelyanov <xemul at parallels.com>于2015年7月14日 周二 21:59写道：
>>>>>>>>>>
>>>>>>>>>>> On 07/14/2015 04:43 PM, Yanbao Cui wrote:
>>>>>>>>>>> > Server is working always and waiting. It seems the client,
>>>>>>>>>>> which is in the container, cannot send data out after restored.
>>>>>>>>>>> >
>>>>>>>>>>> > For TCP, yeah, the client try to reconnect manually.
>>>>>>>>>>>
>>>>>>>>>>> You mean that after restore new connect()-s hang for a while?
>>>>>>>>>>> Why do these connect()-s happen?
>>>>>>>>>>> Or _existing_ migrated connections hang?
>>>>>>>>>>>
>>>>>>>>>>> > The delay is happened after restore successful, although the
>>>>>>>>>>> network is recovered
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > Pavel Emelyanov <xemul at parallels.com <mailto:
>>>>>>>>>>> xemul at parallels.com>>于2015年7月14日 周二 21:31写道：
>>>>>>>>>>> >
>>>>>>>>>>> >     On 07/14/2015 04:15 PM, Yanbao Cui wrote:
>>>>>>>>>>> >     > Sorry for mistake.
>>>>>>>>>>> >     > For UDP, I mean the sever can receive the packet from
>>>>>>>>>>> client again.
>>>>>>>>>>> >
>>>>>>>>>>> >     So where's the 0.5 seconds delay? Server sleeps and
>>>>>>>>>>> doesn't wake up, packets
>>>>>>>>>>> >     do not reach the server or something else?
>>>>>>>>>>> >
>>>>>>>>>>> >     > Actually, I have analysis the tcpdump output, in my
>>>>>>>>>>> case, the client try to reconnect
>>>>>>>>>>> >     > to the server again, but can not receive SYN+ACK, so it
>>>>>>>>>>> re-transmission after 1 second
>>>>>>>>>>> >     > according to the client rule, and then try again.
>>>>>>>>>>> >
>>>>>>>>>>> >     During migration we don't reconnect TCP (with regular SYN,
>>>>>>>>>>> SYNACK, ACK sequence),
>>>>>>>>>>> >     do you reconnect them manually?
>>>>>>>>>>> >
>>>>>>>>>>> >     -- Pavel
>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> CRIU mailing list
>>>>>>>>> CRIU at openvz.org
>>>>>>>>> https://lists.openvz.org/mailman/listinfo/criu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards
>>>>>>> Cui Yanbao | 崔言宝
>>>>>>> --
>>>>>>> 龍生玖天，豈能安於凡塵！
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>> Cui Yanbao | 崔言宝
>>>>> --
>>>>> 龍生玖天，豈能安於凡塵！
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards
>>> Cui Yanbao | 崔言宝
>>> --
>>> 龍生玖天，豈能安於凡塵！
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20150721/500998b5/attachment-0001.html>