[CRIU] Restore error

Mon Feb 29 04:02:45 PST 2016

On 02/27/2016 10:39 AM, Gabriel Southern wrote:
> Hi,
> 
> I'm using CRIU with Docker using the Docker fork https://github.com/boucher/docker.git.
> 
> Sometimes my attempts to restore a container fail and when I look in the criu logs I see something like the following error (full restore log available here: https://gist.github.com/southerngs/34d3ce928f35e24e3dbb)
> 
> (00.401627)      1: Restoring resources
> (00.401633)     22: Restoring fd 0 (state -> prepare)
> (00.401644)     22: Create transport fd /crtools-fd-22-0
> (00.401644)      1: Opening fdinfo-s
> (00.401652)      1: Restoring fd 0 (state -> prepare)
> (00.401655)      1: Restoring fd 1 (state -> prepare)
> (00.401658)      1: Restoring fd 2 (state -> prepare)
> (00.401660)      1: Restoring fd 0 (state -> create)
> (00.401663)     22: Error (files.c:840): Can't bind unix socket /crtools-fd-22-0: Address already in use
> (00.401684)      1: Create fd for 0
> (00.401687)      1: Wait fdinfo pid=22 fd=0
> (00.403238)      1: Error (cr-restore.c:1302): 22 exited, status=1
> (00.459682) Error (cr-restore.c:1304): 6804 killed by signal 9
> (00.526451) Error (cr-restore.c:2130): Restoring FAILED.
> 
> This error is not completely deterministic.  Usually if the restore attempt fails if I wait
> and retry the command then it will succeed the second time.  The problem only occurs when there
> is a lot of checkpoint/restore activity going on.

Ah! :) This info is really helpful. You're re-using the netns for all the restores (the
"NS mask to use 2c020000" line in the log). Can you help us understand what's the netns
trick your playing?

And the bug is that criu creates unix sockets (that help to restore files) with not unique
enough names. The fix would be to tune open_transport_fd() routine and add some "salt" to
the call to transport_name_gen() that generates unix socket name, but w/o proper understanding
of what's going on with netns we can come to suboptimal solution.

> But my use case involves restoring a lot of containers simultaneously and letting them run for
> a short period of time.  I might be able to work around this problem by catching an error during
> a failed restore and then retrying.  But if I could reduce the number of failed restore attempts
> that would be helpful for me.
> 
> I'm working with criu from the github master branch.  I produced the error with version: 
> Version: 2.0
> GitID: v1.8-413-g2fd16c3
> 
> Unfortunately I don't know criu works well enough to have good troubleshooting ideas just from
> looking at this log.  So I thought I'd ask here to see if there are any suggestions so I can
> understand the root cause and what I might be able to change to prevent it.  Any advice is 
> appreciated.
> 
> Thanks,
> 
> -Gabriel
> 
> 
> 
> _______________________________________________
> CRIU mailing list
> CRIU at openvz.org
> https://lists.openvz.org/mailman/listinfo/criu
>