[CRIU] [Users] socket will take at least 0.5 seconds to recovery after docker restore done

Saied Kazemi saied at google.com
Mon Jul 20 09:29:24 PDT 2015


Hi Yanbao,

In my opinion, running in its own network namespace and using veth devices
should not cause a delay.  The delay is perhaps related to moving the veth
device into the bridge (docker0) and setting up iptables rules to forward
traffic to the container.  If you're interested, you can actually verify
this by adding simple log output to both Docker and CRIU and looking at
time stamps.

As I mentioned previously, the network code has been refactored into
libnetwork.  Therefore, it doesn't make sense to spend time on Docker 1.5
networking code.  As soon as time allows, I will look into Docker 1.8 and
libnetwork but I am afraid it won't be soon.

Sorry that I cannot help more at this time.

--Saied


On Mon, Jul 20, 2015 at 12:24 AM, Yanbao Cui <yygcui at gmail.com> wrote:

> Hi Saied,
>
> Another test shows that if I only run a process with socket connection,
> the connection will recover quicky after C/R.
>
> If run it in Docker, the recovered connection will hang at least 0.5
> second.
>
> The difference is Docker use separate namespaces and use veth pair for
> network
>
> Could you help check it in Docker side?
>
> Yanbao Cui <yygcui at gmail.com>于2015年7月18日周六 08:50写道:
>
>> yeah, but i did't use Docker 1.5 before, so I don't know if it is ok.
>>
>> Actually I think the network is restored by CRIU, rather than create a
>> new one by Docker.
>>
>> In my logic, we only rebuild the container object in Docker daemon, the
>> entire container is restored by CRIU, include processes, network, etc.
>>
>> So I think the problem is NOT caused by Docker, as I describe in the last
>> mails that after restore the docker container is work well except the
>> _existing_ socket restored hang at least 0.5 seconds
>>
>> On Sat, Jul 18, 2015 at 12:30 AM, Saied Kazemi <saied at google.com> wrote:
>>
>>> On Fri, Jul 17, 2015 at 8:50 AM, Yanbao Cui <yygcui at gmail.com> wrote:
>>>
>>> I use the latter one. And we integrade C/R functionality into Docker
>>>> based on https://github.com/SaiedKazemi/docker/wiki
>>>>
>>>
>>>  So, you rebased Docker 1.5 code to 1.6?  Did you see the issue in
>>> Docker 1.5?
>>>
>>>
>>> And I found there is another one based on Docker 1.7
>>>>
>>>
>>> Up until Docker 1.5, the network code was both in libcontainer and in
>>> the Docker engine.  Now all network logic is in libnetwork, so there's no
>>> point spending time on older versions.  Unfortunately I haven't had time
>>> yet to familiarize myself with the new code.  Going forward, I suggest that
>>> you use the Docker 1.7.  It's a rebase of 1.5 to the head and is under
>>> active development by Ross Boucher (rboucher at gmail.com) and other
>>> community members as I am sure you know.
>>>
>>>
>>>
>>>> Did you guys test it and focus on the time consumed?
>>>>
>>>
>>> No my concentration was on getting the network to restore successfully.
>>> Didn't make any time measurements.
>>>
>>> --Saied
>>>
>>>
>>>
>>>> On Fri, Jul 17, 2015 at 10:45 PM, Saied Kazemi <saied at google.com>
>>>> wrote:
>>>>
>>>>> Are you doing external checkpoint restore, calling CRIU directly to
>>>>> dump and restore the container, or are you using native "docker checkpoint"
>>>>> and "docker restore" commands?  If latter, did you integrate C/R
>>>>> functionality into Docker yourself?
>>>>>
>>>>> --Saied
>>>>>
>>>>>
>>>>> On Fri, Jul 17, 2015 at 5:56 AM, Yanbao Cui <yygcui at gmail.com> wrote:
>>>>>
>>>>>> I use docker 1.6.0 and 1.6.2, they all have this problem.
>>>>>>
>>>>>> the needed files are shared via NFS.
>>>>>>
>>>>>> On Fri, Jul 17, 2015 at 11:15 AM, Saied Kazemi <saied at google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Which Docker version are you using to checkpoint and restore your
>>>>>>> containers?  Also, for migration, are you manually copying the container to
>>>>>>> a target machine?
>>>>>>>
>>>>>>> --Saied
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 14, 2015 at 7:36 AM, Yanbao Cui <yygcui at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Correct my reply:
>>>>>>>>
>>>>>>>> _existing_ migrated connections hang.
>>>>>>>>
>>>>>>>> New connection (here I mean new socket or a new process, not as
>>>>>>>> like reconnection manually) is OK
>>>>>>>>
>>>>>>>>
>>>>>>>> Yanbao Cui <yygcui at gmail.com>于2015年7月14日 周二 22:07写道:
>>>>>>>>
>>>>>>>>> _existing_ migrated connections hang.
>>>>>>>>>
>>>>>>>>> New connection is OK
>>>>>>>>>
>>>>>>>>> Pavel Emelyanov <xemul at parallels.com>于2015年7月14日 周二 21:59写道:
>>>>>>>>>
>>>>>>>>>> On 07/14/2015 04:43 PM, Yanbao Cui wrote:
>>>>>>>>>> > Server is working always and waiting. It seems the client,
>>>>>>>>>> which is in the container, cannot send data out after restored.
>>>>>>>>>> >
>>>>>>>>>> > For TCP, yeah, the client try to reconnect manually.
>>>>>>>>>>
>>>>>>>>>> You mean that after restore new connect()-s hang for a while? Why
>>>>>>>>>> do these connect()-s happen?
>>>>>>>>>> Or _existing_ migrated connections hang?
>>>>>>>>>>
>>>>>>>>>> > The delay is happened after restore successful, although the
>>>>>>>>>> network is recovered
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > Pavel Emelyanov <xemul at parallels.com <mailto:
>>>>>>>>>> xemul at parallels.com>>于2015年7月14日 周二 21:31写道:
>>>>>>>>>> >
>>>>>>>>>> >     On 07/14/2015 04:15 PM, Yanbao Cui wrote:
>>>>>>>>>> >     > Sorry for mistake.
>>>>>>>>>> >     > For UDP, I mean the sever can receive the packet from
>>>>>>>>>> client again.
>>>>>>>>>> >
>>>>>>>>>> >     So where's the 0.5 seconds delay? Server sleeps and doesn't
>>>>>>>>>> wake up, packets
>>>>>>>>>> >     do not reach the server or something else?
>>>>>>>>>> >
>>>>>>>>>> >     > Actually, I have analysis the tcpdump output, in my case,
>>>>>>>>>> the client try to reconnect
>>>>>>>>>> >     > to the server again, but can not receive SYN+ACK, so it
>>>>>>>>>> re-transmission after 1 second
>>>>>>>>>> >     > according to the client rule, and then try again.
>>>>>>>>>> >
>>>>>>>>>> >     During migration we don't reconnect TCP (with regular SYN,
>>>>>>>>>> SYNACK, ACK sequence),
>>>>>>>>>> >     do you reconnect them manually?
>>>>>>>>>> >
>>>>>>>>>> >     -- Pavel
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> CRIU mailing list
>>>>>>>> CRIU at openvz.org
>>>>>>>> https://lists.openvz.org/mailman/listinfo/criu
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards
>>>>>> Cui Yanbao | 崔言宝
>>>>>> --
>>>>>> 龍生玖天,豈能安於凡塵!
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>> Cui Yanbao | 崔言宝
>>>> --
>>>> 龍生玖天,豈能安於凡塵!
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards
>> Cui Yanbao | 崔言宝
>> --
>> 龍生玖天,豈能安於凡塵!
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20150720/f4d3f281/attachment-0001.html>


More information about the CRIU mailing list