[CRIU] [P.haul + Docker] Missing Running States During Live Migration of Container

Lele Ma lelema.cn at gmail.com
Sun Mar 5 14:48:25 PST 2017


On Wed, Feb 22, 2017 at 2:49 AM, Pavel Emelyanov <xemul at virtuozzo.com>
wrote:

> On 02/20/2017 10:47 PM, Lele Ma wrote:
> > On Mon, Feb 20, 2017 at 2:37 PM, Pavel Emelyanov <xemul at virtuozzo.com>
> wrote:
> >> On 02/20/2017 10:23 PM, Lele Ma wrote:
> >>>
> >>> On Mon, Feb 20, 2017 at 1:16 PM, Pavel Emelyanov <xemul at virtuozzo.com
> <mailto:xemul at virtuozzo.com>> wrote:
> >>>
> >>>     On 02/19/2017 10:50 PM, Lele Ma wrote:
> >>>     > Hi All,
> >>>     >
> >>>     > I am testing container live migration with this github repos <
> https://github.com/boucher/docker/tree/v1.10_2-16-16-experimental <
> https://github.com/boucher/docker/tree/v1.10_2-16-16-experimental>> for
> docker-1.10-dev. I found the container not restored exactly where it's
> checkpointed. For example:
> >>>     >
> >>>     > The container I run
> >>>     >      docker run  -d busybox  /bin/sh -c 'echo > /foo;
> max=1000000; i=0; while [ $i -lt $max ] ; do date >> /foo; date +%s >>
> /foo; echo "i=$i" >> /foo; i=$(expr $i + 1 ); sleep 0.0001; done'
> >>>     >
> >>>     > After migrated using p.haul, I got the /foo in target node:
> >>>     > .....
> >>>     > Sun Feb 19 03:23:13 UTC 2017
> >>>     > 1487474593
> >>>     > i=4247
> >>>     > Sun Feb 19 03:23:13 UTC 2017
> >>>     > 1487474593
> >>>     > i=4248                       -----> before migration
> >>>     > i=7545                       -----> after migartion ( it is
> supposed to be i=4249 )
> >>>     > Sun Feb 19 03:23:20 UTC 2017
> >>>     > 1487474600
> >>>     > i=7546
> >>>     > Sun Feb 19 03:23:20 UTC 2017
> >>>     > 1487474600
> >>>     > i=7547
> >>>     > ......
> >>>     > The printed numbers jump from 'i=4248' to 'i=7545' instead of
> increasing by one. It seems that it ignores
> >>>     > some computation status of the docker containers. But I am not
> sure where it goes wrong. However, when I
> >>>     > checkpoint and restore the container locally, the number
> increase continuously with no such jumping.
> >>>
> >>>     Where do you get these numbers from? Docker console or some file
> on disk?
> >>>
> >>>
> >>> It's from the file '/foo' inside container. ( The container is running
> /bin/sh -c 'echo > /foo;
> >>> max=1000000; i=0; while [ $i -lt $max ] ; do date >> /foo; date +%s >>
> /foo; echo "i=$i" >> /foo;
> >>> i=$(expr $i + 1 ); sleep 0.0001; done' )
> >>
> >> Then this is likely a race between images sync and filesystem sync.
> >> You can check your /foo file on the source node right after container
> >> migration, it should contain the missing numbers :)
> >>
> >> What p.haul do you use, btw?
> >
> > Thank you. But how can we avoid the race?
>
> Somewhere the final rsync is missing. But it's just a guess, I'd suggest
> that we first check whether it's really the case. Can you check the /foo
> files on both source and destination nodes?
>
>
Thank you for your help! If the migrated container is restored on the
source node, the numbers in /foo are good. Each is added by 1 with no
jumping. But when restored on the target node, we could see the jumping.
So, does this mean somewhere rsync is missing? How to find which one is
missing?

I found the last time it calls __run_rsync() method is right after the
container is 'dumped' (use checkpoint cmd). So could here is where the
race?  Here is some logging info I get from the console (on source node):

17:17:05.374: 101000: Final dump and restore
17:17:05.409: 101000:     Making directory
/var/local/p.haul-fs/dmp-qY7DBC-17.03.05-17.17/img/1
17:17:05.410: 101000: Dump docker container
64541be4a5de0b655e764088d39cc227e4153aba197c9852de3fea5cd3e2ed0e
17:17:05.411: 101000:     /usr/bin/docker checkpoint
--image-dir=/var/local/p.haul-fs/dmp-qY7DBC-17.03.05-17.17/img/1
64541be4a5de; log file: /tmp/docker_checkpoint.log
17:17:05.985: 101000:     /usr/bin/docker checkpoint
--image-dir=/var/local/p.haul-fs/dmp-qY7DBC-17.03.05-17.17/img/1
64541be4a5de; log file: /tmp/docker_checkpoint.log
17:17:06.024: 101000: Final FS and images sync
17:17:06.024: 101000: Doing final FS sync
17:17:06.025: 101000:      calling __run_rsync(), logging file:
/var/local/p.haul-fs/dmp-qY7DBC-17.03.05-17.17/rsync.log
17:17:08.567: 101000: Sending images to target
17:17:08.603: 101000:     Pack
17:17:08.619: 101000:     Add htype images
17:17:09.109: 101000: Asking target host to restore
17:17:11.887: 101000: Restored on target host


Lele
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20170305/2c184057/attachment.html>


More information about the CRIU mailing list