<div dir="ltr">My branch contains just a few minor changes from what was already there, mostly changes to hard coded values. I don't really remember exactly how the process works, but I think the only "solution" to the race problem is going to be to pause the container before performing the checkpoint and filesystem operations, and then unpausing on resume. I'm not sure if the CRIU bug with not being able to checkpoint frozen processes still exists, but I also know that Docker doesn't currently support the incremental checkpointing stuff, so this solution may be somewhat slow in real world use (though, maybe p.haul isn't actually using `docker checkpoint`?)</div><br><div class="gmail_quote"><div dir="ltr">On Wed, Feb 22, 2017 at 2:49 AM Pavel Emelyanov <<a href="mailto:xemul@virtuozzo.com">xemul@virtuozzo.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 02/20/2017 10:47 PM, Lele Ma wrote:<br class="gmail_msg">
> On Mon, Feb 20, 2017 at 2:37 PM, Pavel Emelyanov <<a href="mailto:xemul@virtuozzo.com" class="gmail_msg" target="_blank">xemul@virtuozzo.com</a>> wrote:<br class="gmail_msg">
>> On 02/20/2017 10:23 PM, Lele Ma wrote:<br class="gmail_msg">
>>><br class="gmail_msg">
>>> On Mon, Feb 20, 2017 at 1:16 PM, Pavel Emelyanov <<a href="mailto:xemul@virtuozzo.com" class="gmail_msg" target="_blank">xemul@virtuozzo.com</a> <mailto:<a href="mailto:xemul@virtuozzo.com" class="gmail_msg" target="_blank">xemul@virtuozzo.com</a>>> wrote:<br class="gmail_msg">
>>><br class="gmail_msg">
>>> On 02/19/2017 10:50 PM, Lele Ma wrote:<br class="gmail_msg">
>>> > Hi All,<br class="gmail_msg">
>>> ><br class="gmail_msg">
>>> > I am testing container live migration with this github repos <<a href="https://github.com/boucher/docker/tree/v1.10_2-16-16-experimental" rel="noreferrer" class="gmail_msg" target="_blank">https://github.com/boucher/docker/tree/v1.10_2-16-16-experimental</a> <<a href="https://github.com/boucher/docker/tree/v1.10_2-16-16-experimental" rel="noreferrer" class="gmail_msg" target="_blank">https://github.com/boucher/docker/tree/v1.10_2-16-16-experimental</a>>> for docker-1.10-dev. I found the container not restored exactly where it's checkpointed. For example:<br class="gmail_msg">
>>> ><br class="gmail_msg">
>>> > The container I run<br class="gmail_msg">
>>> > docker run -d busybox /bin/sh -c 'echo > /foo; max=1000000; i=0; while [ $i -lt $max ] ; do date >> /foo; date +%s >> /foo; echo "i=$i" >> /foo; i=$(expr $i + 1 ); sleep 0.0001; done'<br class="gmail_msg">
>>> ><br class="gmail_msg">
>>> > After migrated using p.haul, I got the /foo in target node:<br class="gmail_msg">
>>> > .....<br class="gmail_msg">
>>> > Sun Feb 19 03:23:13 UTC 2017<br class="gmail_msg">
>>> > 1487474593<br class="gmail_msg">
>>> > i=4247<br class="gmail_msg">
>>> > Sun Feb 19 03:23:13 UTC 2017<br class="gmail_msg">
>>> > 1487474593<br class="gmail_msg">
>>> > i=4248 -----> before migration<br class="gmail_msg">
>>> > i=7545 -----> after migartion ( it is supposed to be i=4249 )<br class="gmail_msg">
>>> > Sun Feb 19 03:23:20 UTC 2017<br class="gmail_msg">
>>> > 1487474600<br class="gmail_msg">
>>> > i=7546<br class="gmail_msg">
>>> > Sun Feb 19 03:23:20 UTC 2017<br class="gmail_msg">
>>> > 1487474600<br class="gmail_msg">
>>> > i=7547<br class="gmail_msg">
>>> > ......<br class="gmail_msg">
>>> > The printed numbers jump from 'i=4248' to 'i=7545' instead of increasing by one. It seems that it ignores<br class="gmail_msg">
>>> > some computation status of the docker containers. But I am not sure where it goes wrong. However, when I<br class="gmail_msg">
>>> > checkpoint and restore the container locally, the number increase continuously with no such jumping.<br class="gmail_msg">
>>><br class="gmail_msg">
>>> Where do you get these numbers from? Docker console or some file on disk?<br class="gmail_msg">
>>><br class="gmail_msg">
>>><br class="gmail_msg">
>>> It's from the file '/foo' inside container. ( The container is running /bin/sh -c 'echo > /foo;<br class="gmail_msg">
>>> max=1000000; i=0; while [ $i -lt $max ] ; do date >> /foo; date +%s >> /foo; echo "i=$i" >> /foo;<br class="gmail_msg">
>>> i=$(expr $i + 1 ); sleep 0.0001; done' )<br class="gmail_msg">
>><br class="gmail_msg">
>> Then this is likely a race between images sync and filesystem sync.<br class="gmail_msg">
>> You can check your /foo file on the source node right after container<br class="gmail_msg">
>> migration, it should contain the missing numbers :)<br class="gmail_msg">
>><br class="gmail_msg">
>> What p.haul do you use, btw?<br class="gmail_msg">
><br class="gmail_msg">
> Thank you. But how can we avoid the race?<br class="gmail_msg">
<br class="gmail_msg">
Somewhere the final rsync is missing. But it's just a guess, I'd suggest<br class="gmail_msg">
that we first check whether it's really the case. Can you check the /foo<br class="gmail_msg">
files on both source and destination nodes?<br class="gmail_msg">
<br class="gmail_msg">
> I am using this repo from<br class="gmail_msg">
> Ross Boucher: <a href="https://github.com/boucher/p.haul/tree/docker-1.10" rel="noreferrer" class="gmail_msg" target="_blank">https://github.com/boucher/p.haul/tree/docker-1.10</a><br class="gmail_msg">
<br class="gmail_msg">
Ah :) That's Ross' fork. Let's ask Ross to join us in this discussion.<br class="gmail_msg">
<br class="gmail_msg">
-- Pavel<br class="gmail_msg">
<br class="gmail_msg">
</blockquote></div>