[CRIU] Dump failure

Wed Aug 12 07:27:16 PDT 2015

On 08/11/2015 10:02 PM, Francisco Tolmasky wrote:
> So, having looked through our dumps, we believe we have a better idea of what’s going on now. Upon restore, we close the previous socket (which is now invalid since the other end has  hung up, and open a new socket). Something like this:
> 
> if (server_fd > 0)
> {
>     ret = close(server_fd);
> }
> 
> …
> 
> server_fd = socket(..., 0);

Some more details here, please :) Is this PF_INET socket? And what do you do
with it afterwards? Call connect()?

> This happens over and over as we restore, checkpoint, restore, checkpoint, etc. However, if 
> I observe the CRIU dumps, I see that each checkpoint has one more socket than before. In 
> other words, the old socket is not being properly cleaned up (in time?) and is thus making
> it into the next checkpoint.

That's ... strange. CRIU collects sockets via FDs only. If you have an fd closed
the socket may stay alive forever, CRIU won't find one.

> Eventually, a restore will have 4 or so sockets, of which one finally actually becomes 
> invalid(?), and as such the next checkpoint will fail.
> 
> Is there any way for me to *really* clean up the sockets from the last checkpoint
> close_and_really_destroy(server_fd), such that they absolutely don’t make it into the 
> next checkpoint?

Is there a single process you dump? If so, can you check the fdinfo-$pid.img files
for what's in there?

> I’ve put 4 dumps here: https://gist.github.com/tolmasky/dc6f288bfacfdd94fbb2
> 
> They are the 4 checkpoints that then restore and checkpoint, etc. The last one failed case.
> You’ll see that the first one has one socket (the one we opened), the second has 2, the third 3, etc.

-- Pavel