[CRIU] Alternative to hacky resume detection

Ross Boucher rboucher at gmail.com
Tue May 12 11:59:24 PDT 2015


In order to get support working in my application, I've resorted to a hack
that works but is almost certainly not the best way to do things. I'm
interested if anyone has suggestions for a better way. First, let me
explain how it works.

The process I'm checkpointing is a node.js process that opens a socket, and
waits for a connection on that socket. Once established, the connecting
process sends code for the node.js process to evaluate, in a loop. The node
process is checkpointed between every message containing new code to
evaluate.

Now, when we restore, it is always a completely new process sending code to
the node.js process, so the built in tcp socket restoration won't work. We
had lots of difficulty figuring out how to detect that the socket
connection had been broken. Ultimately, the hack we ended up using was to
simply loop forever on a separate thread checking the time, and noticing if
an unexplained huge gap in time had occurred. The looping thread looks like
this:


void * canceler(void * threadPointer)
{
    pthread_t thread = *(pthread_t *)threadPointer;

    time_t start,end;
    time(&start);

    while(true)
    {
        usleep(1000);
        time(&end);
        double diff = difftime(end,start);

        if (diff > 1.0) {
            // THIS IS ALMOST CERTAINLY A RESTORE
            break;
        }
    }

    // cancel the read thread

    int result = pthread_cancel(thread);

    return NULL;

}



Elsewhere, in the code that actually does the reading, we spawn this thread
with a handle to the read thread:

pthread_create(&cancelThread, NULL, canceler, (void *)readThread);



The rest of our code understand how to deal with a broken connection and is
able to seamlessly reconnect. This is all working well, but it seems like
there is probably a better way so I wanted to ask for suggestions. I also
tried getting things to work with a file based socket rather than a TCP
socket, but that proved even more difficult (and was far more complicated
in our architecture anyway, so I'd prefer not to return down that path).

- Ross

[1] From my other email thread, this video might help illustrate the actual
process going on, if my description isn't that clear:

https://www.youtube.com/watch?v=F2L6JLFuFWs&feature=youtu.be
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20150512/734d9f21/attachment.html>


More information about the CRIU mailing list