[CRIU] ***SPAM*** Re: Restore error

Gabriel Southern southerngs at gmail.com
Wed Mar 2 15:04:42 PST 2016


Thanks again for the explanation about the failed restore.  I added some
salt to the name in transport_name_gen() by adding a random number to the
name and that seems to have solved the problem I was having with the socket
restore.

But now I am running into another sporadic problem with the restore.  An
example of the error is:

(00.071472)      1: mnt: Mountpoint 6152
(@./var/lib/docker/0.0/containers/cfa3e36de2c78f0ae2694f62fa173f0a80154827c6bb4c6d8a6b69aafadb9635/criu.work/.criu.cgyard.fdEG2f/systemd)
w/o parent 6143
(00.071474)      1: Error (mount.c:366): mnt: No root found for mountpoint
6152
(@./var/lib/docker/0.0/containers/cfa3e36de2c78f0ae2694f62fa173f0a80154827c6bb4c6d8a6b69aafadb9635/criu.work/.criu.cgyard.fdEG2f/systemd)
(00.098587) Error (cr-restore.c:1304): 23529 killed by signal 9
(00.232412) Error (cr-restore.c:2130): Restoring FAILED.

As with the previous error I reported this one only occurs if there are
multiple restore operations occurring concurrently.  If a restore operation
fails, if I wait and repeat it when no other operations are occurring then
it succeeds.  I've included the log file for a failed restore, along with
the same operation completing successfully here:
https://gist.github.com/southerngs/a1bd786fcc4834678cfb

Any debugging suggestions are appreciated.

Thanks,

-Gabriel

On Mon, Feb 29, 2016 at 4:02 AM, Pavel Emelyanov <xemul at virtuozzo.com>
wrote:

> On 02/27/2016 10:39 AM, Gabriel Southern wrote:
> > Hi,
> >
> > I'm using CRIU with Docker using the Docker fork
> https://github.com/boucher/docker.git.
> >
> > Sometimes my attempts to restore a container fail and when I look in the
> criu logs I see something like the following error (full restore log
> available here: https://gist.github.com/southerngs/34d3ce928f35e24e3dbb)
> >
> > (00.401627)      1: Restoring resources
> > (00.401633)     22: Restoring fd 0 (state -> prepare)
> > (00.401644)     22: Create transport fd /crtools-fd-22-0
> > (00.401644)      1: Opening fdinfo-s
> > (00.401652)      1: Restoring fd 0 (state -> prepare)
> > (00.401655)      1: Restoring fd 1 (state -> prepare)
> > (00.401658)      1: Restoring fd 2 (state -> prepare)
> > (00.401660)      1: Restoring fd 0 (state -> create)
> > (00.401663)     22: Error (files.c:840): Can't bind unix socket
> /crtools-fd-22-0: Address already in use
> > (00.401684)      1: Create fd for 0
> > (00.401687)      1: Wait fdinfo pid=22 fd=0
> > (00.403238)      1: Error (cr-restore.c:1302): 22 exited, status=1
> > (00.459682) Error (cr-restore.c:1304): 6804 killed by signal 9
> > (00.526451) Error (cr-restore.c:2130): Restoring FAILED.
> >
> > This error is not completely deterministic.  Usually if the restore
> attempt fails if I wait
> > and retry the command then it will succeed the second time.  The problem
> only occurs when there
> > is a lot of checkpoint/restore activity going on.
>
> Ah! :) This info is really helpful. You're re-using the netns for all the
> restores (the
> "NS mask to use 2c020000" line in the log). Can you help us understand
> what's the netns
> trick your playing?
>
> And the bug is that criu creates unix sockets (that help to restore files)
> with not unique
> enough names. The fix would be to tune open_transport_fd() routine and add
> some "salt" to
> the call to transport_name_gen() that generates unix socket name, but w/o
> proper understanding
> of what's going on with netns we can come to suboptimal solution.
>
> > But my use case involves restoring a lot of containers simultaneously
> and letting them run for
> > a short period of time.  I might be able to work around this problem by
> catching an error during
> > a failed restore and then retrying.  But if I could reduce the number of
> failed restore attempts
> > that would be helpful for me.
> >
> > I'm working with criu from the github master branch.  I produced the
> error with version:
> > Version: 2.0
> > GitID: v1.8-413-g2fd16c3
> >
> > Unfortunately I don't know criu works well enough to have good
> troubleshooting ideas just from
> > looking at this log.  So I thought I'd ask here to see if there are any
> suggestions so I can
> > understand the root cause and what I might be able to change to prevent
> it.  Any advice is
> > appreciated.
> >
> > Thanks,
> >
> > -Gabriel
> >
> >
> >
> > _______________________________________________
> > CRIU mailing list
> > CRIU at openvz.org
> > https://lists.openvz.org/mailman/listinfo/criu
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20160302/5bf769ed/attachment.html>


More information about the CRIU mailing list