[CRIU] Fwd: Dump failure

Tue Aug 11 12:57:47 PDT 2015

So, having looked through our dumps, we believe we have a better idea of
what’s going on now. Upon restore, we close the previous socket (which is
now invalid since the other end has  hung up, and open a new socket).
Something like this:

if (server_fd > 0)
{
    ret = close(server_fd);
}

…

server_fd = socket(..., 0);

This happens over and over as we restore, checkpoint, restore, checkpoint,
etc. However, if I observe the CRIU dumps, I see that each checkpoint has
one more socket than before. In other words, the old socket is not being
properly cleaned up (in time?) and is thus making it into the next
checkpoint. Eventually, a restore will have 4 or so sockets, of which one
finally actually becomes invalid(?), and as such the next checkpoint will
fail.

Is there any way for me to *really* clean up the sockets from the last
checkpoint close_and_really_destroy(server_fd), such that they absolutely
don’t make it into the next checkpoint?

I’ve put 4 dumps here: https://gist.github.com/tolmasky/dc6f288bfacfdd94fbb2

They are the 4 checkpoints that then restore and checkpoint, etc. The last
one failed case. You’ll see that the first one has one socket (the one we
opened), the second has 2, the third 3, etc.

On Mon, Aug 10, 2015 at 11:54 AM, Ross Boucher <rboucher at gmail.com> wrote:

> this is the thread i was discussing the issue in
>
> ---------- Forwarded message ----------
> From: Pavel Emelyanov <xemul at parallels.com>
> Date: Fri, Jul 17, 2015 at 5:09 AM
> Subject: Re: [CRIU] Dump failure
> To: Ross Boucher <rboucher at gmail.com>
> Cc: CRIU <criu at openvz.org>
>
>
> On 07/17/2015 02:33 AM, Ross Boucher wrote:
> > I can't reproduce it reliably, but it seems to happen about 10% of the
> time in my setup
> > (though, there isn't a ton of data at this point). I can try to gather
> some more information
> > for you if that would be helpful.
>
> Yes, please. And I'll try to think what kind of debug can be useful for it.
>
> > On Thu, Jul 16, 2015 at 2:30 AM, Pavel Emelyanov <xemul at parallels.com
> <mailto:xemul at parallels.com>> wrote:
> >
> >     On 07/16/2015 02:56 AM, Ross Boucher wrote:
> >     > I got this failure today when checkpointing a container in my
> system:
> >     >
> >     > https://gist.github.com/boucher/ac5ac25c358e5a24665b
> >     >
> >     > Any idea what the cause might be?
> >
> >     Yup
> >
> >     Error (sk-inet.c:188): Name resolved on unconnected socket
> >
> >     We see reports about this from time to time. The error means, that
> there's
> >     some socket in the system, that is owned by a process (via fd), but
> while
> >     getting all the sockets via sock-diag API (in collect_sockets) this
> particular
> >     one was _not_ there. This happens only if the socket is freshly
> created with
> >     socket() call and is not yet bound or connected. The get_unconn_sk()
> function
> >     is called for such sockets -- found via some task's fd, but not
> found in diag
> >     output. We check that the socket in question is truly unbound and
> unconnected,
> >     but in your case the check fails.
> >
> >     That's the best guess we have, but we cannot check one, since this
> situation
> >     occurs rarely. Do you know how to reproduce one more or less
> reliably?
> >
> >     -- Pavel
> >
> >
>
>
>

-- 
Francisco Tolmasky
www.tolmasky.com
tolmasky at gmail.com

-- 
Francisco Tolmasky
www.tolmasky.com
tolmasky at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20150811/04eb2ef4/attachment-0001.html>