[CRIU] [P.haul + Docker] Missing Running States During Live Migration of Container

Tue Feb 28 07:13:43 PST 2017

On 02/22/2017 10:16 PM, Ross Boucher wrote:
> My branch contains just a few minor changes from what was already there, mostly changes to hard coded 
> values. I don't really remember exactly how the process works, but I think the only "solution" to the
> race problem is going to be to pause the container before performing the checkpoint and filesystem 
> operations, and then unpausing on resume. I'm not sure if the CRIU bug with not being able to checkpoint
> frozen processes still exists, but I also know that Docker doesn't currently support the incremental 
> checkpointing stuff, so this solution may be somewhat slow in real world use (though, maybe p.haul isn't 
> actually using `docker checkpoint`?)

It doesn't :( The newer plan, actually, is to drive go phaul to working state
and make "native" live migration in Docker using this go code.

-- Pavel

> On Wed, Feb 22, 2017 at 2:49 AM Pavel Emelyanov <xemul at virtuozzo.com <mailto:xemul at virtuozzo.com>> wrote:
> 
>     On 02/20/2017 10:47 PM, Lele Ma wrote:
>     > On Mon, Feb 20, 2017 at 2:37 PM, Pavel Emelyanov <xemul at virtuozzo.com <mailto:xemul at virtuozzo.com>> wrote:
>     >> On 02/20/2017 10:23 PM, Lele Ma wrote:
>     >>>
>     >>> On Mon, Feb 20, 2017 at 1:16 PM, Pavel Emelyanov <xemul at virtuozzo.com <mailto:xemul at virtuozzo.com> <mailto:xemul at virtuozzo.com <mailto:xemul at virtuozzo.com>>> wrote:
>     >>>
>     >>>     On 02/19/2017 10:50 PM, Lele Ma wrote:
>     >>>     > Hi All,
>     >>>     >
>     >>>     > I am testing container live migration with this github repos <https://github.com/boucher/docker/tree/v1.10_2-16-16-experimental <https://github.com/boucher/docker/tree/v1.10_2-16-16-experimental>> for docker-1.10-dev. I found the container not restored exactly where it's checkpointed. For example:
>     >>>     >
>     >>>     > The container I run
>     >>>     >      docker run  -d busybox  /bin/sh -c 'echo > /foo; max=1000000; i=0; while [ $i -lt $max ] ; do date >> /foo; date +%s >> /foo; echo "i=$i" >> /foo; i=$(expr $i + 1 ); sleep 0.0001; done'
>     >>>     >
>     >>>     > After migrated using p.haul, I got the /foo in target node:
>     >>>     > .....
>     >>>     > Sun Feb 19 03:23:13 UTC 2017
>     >>>     > 1487474593
>     >>>     > i=4247
>     >>>     > Sun Feb 19 03:23:13 UTC 2017
>     >>>     > 1487474593
>     >>>     > i=4248                       -----> before migration
>     >>>     > i=7545                       -----> after migartion ( it is supposed to be i=4249 )
>     >>>     > Sun Feb 19 03:23:20 UTC 2017
>     >>>     > 1487474600
>     >>>     > i=7546
>     >>>     > Sun Feb 19 03:23:20 UTC 2017
>     >>>     > 1487474600
>     >>>     > i=7547
>     >>>     > ......
>     >>>     > The printed numbers jump from 'i=4248' to 'i=7545' instead of increasing by one. It seems that it ignores
>     >>>     > some computation status of the docker containers. But I am not sure where it goes wrong. However, when I
>     >>>     > checkpoint and restore the container locally, the number increase continuously with no such jumping.
>     >>>
>     >>>     Where do you get these numbers from? Docker console or some file on disk?
>     >>>
>     >>>
>     >>> It's from the file '/foo' inside container. ( The container is running /bin/sh -c 'echo > /foo;
>     >>> max=1000000; i=0; while [ $i -lt $max ] ; do date >> /foo; date +%s >> /foo; echo "i=$i" >> /foo;
>     >>> i=$(expr $i + 1 ); sleep 0.0001; done' )
>     >>
>     >> Then this is likely a race between images sync and filesystem sync.
>     >> You can check your /foo file on the source node right after container
>     >> migration, it should contain the missing numbers :)
>     >>
>     >> What p.haul do you use, btw?
>     >
>     > Thank you. But how can we avoid the race?
> 
>     Somewhere the final rsync is missing. But it's just a guess, I'd suggest
>     that we first check whether it's really the case. Can you check the /foo
>     files on both source and destination nodes?
> 
>     > I am using this repo from
>     > Ross Boucher: https://github.com/boucher/p.haul/tree/docker-1.10
> 
>     Ah :) That's Ross' fork. Let's ask Ross to join us in this discussion.
> 
>     -- Pavel
>