[CRIU] Dump failure
Pavel Emelyanov
xemul at parallels.com
Wed Aug 12 07:27:16 PDT 2015
On 08/11/2015 10:02 PM, Francisco Tolmasky wrote:
> So, having looked through our dumps, we believe we have a better idea of what’s going on now. Upon restore, we close the previous socket (which is now invalid since the other end has hung up, and open a new socket). Something like this:
>
> if (server_fd > 0)
> {
> ret = close(server_fd);
> }
>
> …
>
> server_fd = socket(..., 0);
Some more details here, please :) Is this PF_INET socket? And what do you do
with it afterwards? Call connect()?
> This happens over and over as we restore, checkpoint, restore, checkpoint, etc. However, if
> I observe the CRIU dumps, I see that each checkpoint has one more socket than before. In
> other words, the old socket is not being properly cleaned up (in time?) and is thus making
> it into the next checkpoint.
That's ... strange. CRIU collects sockets via FDs only. If you have an fd closed
the socket may stay alive forever, CRIU won't find one.
> Eventually, a restore will have 4 or so sockets, of which one finally actually becomes
> invalid(?), and as such the next checkpoint will fail.
>
> Is there any way for me to *really* clean up the sockets from the last checkpoint
> close_and_really_destroy(server_fd), such that they absolutely don’t make it into the
> next checkpoint?
Is there a single process you dump? If so, can you check the fdinfo-$pid.img files
for what's in there?
> I’ve put 4 dumps here: https://gist.github.com/tolmasky/dc6f288bfacfdd94fbb2
>
> They are the 4 checkpoints that then restore and checkpoint, etc. The last one failed case.
> You’ll see that the first one has one socket (the one we opened), the second has 2, the third 3, etc.
-- Pavel
More information about the CRIU
mailing list