<div dir="ltr">Thanks again for the explanation about the failed restore. I added some salt to the name in<span style="font-size:12.8px"> transport_name_gen() by adding a random number to the name and that seems to have solved the problem I was having with the socket restore. </span><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">But now I am running into another sporadic problem with the restore. An example of the error is:</span></div><div><span style="font-size:12.8px"><br></span></div><div><div style=""><span style="font-size:12.8px">(00.071472) 1: mnt: Mountpoint 6152 (@./var/lib/docker/0.0/containers/cfa3e36de2c78f0ae2694f62fa173f0a80154827c6bb4c6d8a6b69aafadb9635/criu.work/.criu.cgyard.fdEG2f/systemd) w/o parent 6143</span></div><div style=""><span style="font-size:12.8px">(00.071474) 1: Error (mount.c:366): mnt: No root found for mountpoint 6152 (@./var/lib/docker/0.0/containers/cfa3e36de2c78f0ae2694f62fa173f0a80154827c6bb4c6d8a6b69aafadb9635/criu.work/.criu.cgyard.fdEG2f/systemd)</span></div><div style=""><span style="font-size:12.8px">(00.098587) Error (cr-restore.c:1304): 23529 killed by signal 9</span></div><div style=""><span style="font-size:12.8px">(00.232412) Error (cr-restore.c:2130): Restoring FAILED.</span></div><div style=""><span style="font-size:12.8px"><br></span></div><div style=""><span style="font-size:12.8px">As with the previous error I reported this one only occurs if there are multiple restore operations occurring concurrently. If a restore operation fails, if I wait and repeat it when no other operations are occurring then it succeeds. I've included the log file for a failed restore, along with the same operation completing successfully here: <a href="https://gist.github.com/southerngs/a1bd786fcc4834678cfb">https://gist.github.com/southerngs/a1bd786fcc4834678cfb</a></span></div><div style=""><span style="font-size:12.8px"><br></span></div><div style=""><span style="font-size:12.8px">Any debugging suggestions are appreciated.</span></div><div style=""><span style="font-size:12.8px"><br></span></div><div style=""><span style="font-size:12.8px">Thanks,</span></div><div style=""><span style="font-size:12.8px"><br></span></div><div style="">-Gabriel</div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 29, 2016 at 4:02 AM, Pavel Emelyanov <span dir="ltr"><<a href="mailto:xemul@virtuozzo.com" target="_blank">xemul@virtuozzo.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class="">On 02/27/2016 10:39 AM, Gabriel Southern wrote:<br>
> Hi,<br>
><br>
> I'm using CRIU with Docker using the Docker fork <a href="https://github.com/boucher/docker.git" rel="noreferrer" target="_blank">https://github.com/boucher/docker.git</a>.<br>
><br>
> Sometimes my attempts to restore a container fail and when I look in the criu logs I see something like the following error (full restore log available here: <a href="https://gist.github.com/southerngs/34d3ce928f35e24e3dbb" rel="noreferrer" target="_blank">https://gist.github.com/southerngs/34d3ce928f35e24e3dbb</a>)<br>
><br>
> (00.401627) 1: Restoring resources<br>
> (00.401633) 22: Restoring fd 0 (state -> prepare)<br>
> (00.401644) 22: Create transport fd /crtools-fd-22-0<br>
> (00.401644) 1: Opening fdinfo-s<br>
> (00.401652) 1: Restoring fd 0 (state -> prepare)<br>
> (00.401655) 1: Restoring fd 1 (state -> prepare)<br>
> (00.401658) 1: Restoring fd 2 (state -> prepare)<br>
> (00.401660) 1: Restoring fd 0 (state -> create)<br>
> (00.401663) 22: Error (files.c:840): Can't bind unix socket /crtools-fd-22-0: Address already in use<br>
> (00.401684) 1: Create fd for 0<br>
> (00.401687) 1: Wait fdinfo pid=22 fd=0<br>
> (00.403238) 1: Error (cr-restore.c:1302): 22 exited, status=1<br>
> (00.459682) Error (cr-restore.c:1304): 6804 killed by signal 9<br>
> (00.526451) Error (cr-restore.c:2130): Restoring FAILED.<br>
><br>
> This error is not completely deterministic. Usually if the restore attempt fails if I wait<br>
> and retry the command then it will succeed the second time. The problem only occurs when there<br>
> is a lot of checkpoint/restore activity going on.<br>
<br>
</span>Ah! :) This info is really helpful. You're re-using the netns for all the restores (the<br>
"NS mask to use 2c020000" line in the log). Can you help us understand what's the netns<br>
trick your playing?<br>
<br>
And the bug is that criu creates unix sockets (that help to restore files) with not unique<br>
enough names. The fix would be to tune open_transport_fd() routine and add some "salt" to<br>
the call to transport_name_gen() that generates unix socket name, but w/o proper understanding<br>
of what's going on with netns we can come to suboptimal solution.<br>
<span class=""><br>
> But my use case involves restoring a lot of containers simultaneously and letting them run for<br>
> a short period of time. I might be able to work around this problem by catching an error during<br>
> a failed restore and then retrying. But if I could reduce the number of failed restore attempts<br>
> that would be helpful for me.<br>
><br>
> I'm working with criu from the github master branch. I produced the error with version:<br>
> Version: 2.0<br>
> GitID: v1.8-413-g2fd16c3<br>
><br>
> Unfortunately I don't know criu works well enough to have good troubleshooting ideas just from<br>
> looking at this log. So I thought I'd ask here to see if there are any suggestions so I can<br>
> understand the root cause and what I might be able to change to prevent it. Any advice is<br>
> appreciated.<br>
><br>
> Thanks,<br>
><br>
> -Gabriel<br>
><br>
><br>
><br>
</span>> _______________________________________________<br>
> CRIU mailing list<br>
> <a href="mailto:CRIU@openvz.org">CRIU@openvz.org</a><br>
> <a href="https://lists.openvz.org/mailman/listinfo/criu" rel="noreferrer" target="_blank">https://lists.openvz.org/mailman/listinfo/criu</a><br>
><br>
<br>
</blockquote></div><br></div></div></div>