<div dir="ltr">So, having looked through our dumps, we believe we have a better idea of what’s going on now. Upon restore, we close the previous socket (which is now invalid since the other end has  hung up, and open a new socket). Something like this:<br><div class="gmail_quote"><div dir="ltr"><div><br></div><div>if (server_fd &gt; 0)</div><div>{</div><div>    ret = close(server_fd);</div><div>}</div><div><br></div><div>…</div><div><br></div><div>server_fd = socket(..., 0);</div><div><br></div><div>This happens over and over as we restore, checkpoint, restore, checkpoint, etc. However, if I observe the CRIU dumps, I see that each checkpoint has one more socket than before. In other words, the old socket is not being properly cleaned up (in time?) and is thus making it into the next checkpoint. Eventually, a restore will have 4 or so sockets, of which one finally actually becomes invalid(?), and as such the next checkpoint will fail.</div><div><br></div><div>Is there any way for me to *really* clean up the sockets from the last checkpoint close_and_really_destroy(server_fd), such that they absolutely don’t make it into the next checkpoint?</div><div><br></div><div>I’ve put 4 dumps here: <a href="https://gist.github.com/tolmasky/dc6f288bfacfdd94fbb2" target="_blank">https://gist.github.com/tolmasky/dc6f288bfacfdd94fbb2</a></div><div><br></div><div>They are the 4 checkpoints that then restore and checkpoint, etc. The last one failed case. You’ll see that the first one has one socket (the one we opened), the second has 2, the third 3, etc.</div></div><div class="gmail_extra"><div><div class="h5"><br><div class="gmail_quote">On Mon, Aug 10, 2015 at 11:54 AM, Ross Boucher <span dir="ltr">&lt;<a href="mailto:rboucher@gmail.com" target="_blank">rboucher@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">this is the thread i was discussing the issue in<div><br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Pavel Emelyanov</b> <span dir="ltr">&lt;<a href="mailto:xemul@parallels.com" target="_blank">xemul@parallels.com</a>&gt;</span><br>Date: Fri, Jul 17, 2015 at 5:09 AM<br>Subject: Re: [CRIU] Dump failure<br>To: Ross Boucher &lt;<a href="mailto:rboucher@gmail.com" target="_blank">rboucher@gmail.com</a>&gt;<br>Cc: CRIU &lt;<a href="mailto:criu@openvz.org" target="_blank">criu@openvz.org</a>&gt;<br><br><br><span>On 07/17/2015 02:33 AM, Ross Boucher wrote:<br>

&gt; I can&#39;t reproduce it reliably, but it seems to happen about 10% of the time in my setup<br>

&gt; (though, there isn&#39;t a ton of data at this point). I can try to gather some more information<br>

&gt; for you if that would be helpful.<br>

<br>

</span>Yes, please. And I&#39;ll try to think what kind of debug can be useful for it.<br>

<div><div><br>

&gt; On Thu, Jul 16, 2015 at 2:30 AM, Pavel Emelyanov &lt;<a href="mailto:xemul@parallels.com" target="_blank">xemul@parallels.com</a> &lt;mailto:<a href="mailto:xemul@parallels.com" target="_blank">xemul@parallels.com</a>&gt;&gt; wrote:<br>

&gt;<br>

&gt;     On 07/16/2015 02:56 AM, Ross Boucher wrote:<br>

&gt;     &gt; I got this failure today when checkpointing a container in my system:<br>

&gt;     &gt;<br>

&gt;     &gt; <a href="https://gist.github.com/boucher/ac5ac25c358e5a24665b" rel="noreferrer" target="_blank">https://gist.github.com/boucher/ac5ac25c358e5a24665b</a><br>

&gt;     &gt;<br>

&gt;     &gt; Any idea what the cause might be?<br>

&gt;<br>

&gt;     Yup<br>

&gt;<br>

&gt;     Error (sk-inet.c:188): Name resolved on unconnected socket<br>

&gt;<br>

&gt;     We see reports about this from time to time. The error means, that there&#39;s<br>

&gt;     some socket in the system, that is owned by a process (via fd), but while<br>

&gt;     getting all the sockets via sock-diag API (in collect_sockets) this particular<br>

&gt;     one was _not_ there. This happens only if the socket is freshly created with<br>

&gt;     socket() call and is not yet bound or connected. The get_unconn_sk() function<br>

&gt;     is called for such sockets -- found via some task&#39;s fd, but not found in diag<br>

&gt;     output. We check that the socket in question is truly unbound and unconnected,<br>

&gt;     but in your case the check fails.<br>

&gt;<br>

&gt;     That&#39;s the best guess we have, but we cannot check one, since this situation<br>

&gt;     occurs rarely. Do you know how to reproduce one more or less reliably?<br>

&gt;<br>

&gt;     -- Pavel<br>

&gt;<br>

&gt;<br>

<br>

</div></div></div><br></div></div>

</blockquote></div><br><br clear="all"><div><br></div></div></div><span class="HOEnZb"><font color="#888888">-- <br><div>Francisco Tolmasky<br><a href="http://www.tolmasky.com" target="_blank">www.tolmasky.com</a><br><a href="mailto:tolmasky@gmail.com" target="_blank">tolmasky@gmail.com</a></div>

</font></span></div>

</div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">Francisco Tolmasky<br><a href="http://www.tolmasky.com" target="_blank">www.tolmasky.com</a><br><a href="mailto:tolmasky@gmail.com" target="_blank">tolmasky@gmail.com</a></div>

</div>