[CRIU] [P.haul + Docker] Missing Running States During Live Migration of Container

Ross Boucher rboucher at gmail.com
Wed Feb 22 11:16:03 PST 2017


My branch contains just a few minor changes from what was already there,
mostly changes to hard coded values. I don't really remember exactly how
the process works, but I think the only "solution" to the race problem is
going to be to pause the container before performing the checkpoint and
filesystem operations, and then unpausing on resume. I'm not sure if the
CRIU bug with not being able to checkpoint frozen processes still exists,
but I also know that Docker doesn't currently support the incremental
checkpointing stuff, so this solution may be somewhat slow in real world
use (though, maybe p.haul isn't actually using `docker checkpoint`?)

On Wed, Feb 22, 2017 at 2:49 AM Pavel Emelyanov <xemul at virtuozzo.com> wrote:

> On 02/20/2017 10:47 PM, Lele Ma wrote:
> > On Mon, Feb 20, 2017 at 2:37 PM, Pavel Emelyanov <xemul at virtuozzo.com>
> wrote:
> >> On 02/20/2017 10:23 PM, Lele Ma wrote:
> >>>
> >>> On Mon, Feb 20, 2017 at 1:16 PM, Pavel Emelyanov <xemul at virtuozzo.com
> <mailto:xemul at virtuozzo.com>> wrote:
> >>>
> >>>     On 02/19/2017 10:50 PM, Lele Ma wrote:
> >>>     > Hi All,
> >>>     >
> >>>     > I am testing container live migration with this github repos <
> https://github.com/boucher/docker/tree/v1.10_2-16-16-experimental <
> https://github.com/boucher/docker/tree/v1.10_2-16-16-experimental>> for
> docker-1.10-dev. I found the container not restored exactly where it's
> checkpointed. For example:
> >>>     >
> >>>     > The container I run
> >>>     >      docker run  -d busybox  /bin/sh -c 'echo > /foo;
> max=1000000; i=0; while [ $i -lt $max ] ; do date >> /foo; date +%s >>
> /foo; echo "i=$i" >> /foo; i=$(expr $i + 1 ); sleep 0.0001; done'
> >>>     >
> >>>     > After migrated using p.haul, I got the /foo in target node:
> >>>     > .....
> >>>     > Sun Feb 19 03:23:13 UTC 2017
> >>>     > 1487474593
> >>>     > i=4247
> >>>     > Sun Feb 19 03:23:13 UTC 2017
> >>>     > 1487474593
> >>>     > i=4248                       -----> before migration
> >>>     > i=7545                       -----> after migartion ( it is
> supposed to be i=4249 )
> >>>     > Sun Feb 19 03:23:20 UTC 2017
> >>>     > 1487474600
> >>>     > i=7546
> >>>     > Sun Feb 19 03:23:20 UTC 2017
> >>>     > 1487474600
> >>>     > i=7547
> >>>     > ......
> >>>     > The printed numbers jump from 'i=4248' to 'i=7545' instead of
> increasing by one. It seems that it ignores
> >>>     > some computation status of the docker containers. But I am not
> sure where it goes wrong. However, when I
> >>>     > checkpoint and restore the container locally, the number
> increase continuously with no such jumping.
> >>>
> >>>     Where do you get these numbers from? Docker console or some file
> on disk?
> >>>
> >>>
> >>> It's from the file '/foo' inside container. ( The container is running
> /bin/sh -c 'echo > /foo;
> >>> max=1000000; i=0; while [ $i -lt $max ] ; do date >> /foo; date +%s >>
> /foo; echo "i=$i" >> /foo;
> >>> i=$(expr $i + 1 ); sleep 0.0001; done' )
> >>
> >> Then this is likely a race between images sync and filesystem sync.
> >> You can check your /foo file on the source node right after container
> >> migration, it should contain the missing numbers :)
> >>
> >> What p.haul do you use, btw?
> >
> > Thank you. But how can we avoid the race?
>
> Somewhere the final rsync is missing. But it's just a guess, I'd suggest
> that we first check whether it's really the case. Can you check the /foo
> files on both source and destination nodes?
>
> > I am using this repo from
> > Ross Boucher: https://github.com/boucher/p.haul/tree/docker-1.10
>
> Ah :) That's Ross' fork. Let's ask Ross to join us in this discussion.
>
> -- Pavel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20170222/240eb750/attachment-0001.html>


More information about the CRIU mailing list