[CRIU] Alternative to hacky resume detection

Ross Boucher rboucher at gmail.com
Tue May 12 13:25:48 PDT 2015


I'm not sure I follow. You're saying, the process that actually calls
restore would get notified? Or, are you saying that somehow in the restored
process I can access something set by criu?

Assuming the former, I don't think that's necessary -- I already know that
I've just restored the process. I could try to send a signal from the
coordinating process and then use that signal to cancel the read thread,
which would be mostly the same thing. But because that would have to travel
through quite a few layers, it seems like it would be better and more
performant to do it from within the restored process itself.

Perhaps I am just misunderstanding your suggestion though.


On Tue, May 12, 2015 at 12:37 PM, Ruslan Kuprieiev <kupruser at gmail.com>
wrote:

>  Hi, Ross
>
> When restoring using RPC or Libcriu response message contains "restored"
> field set to true,
> that help process to detect if it was restored. You say that every time
> you restore the connection
> is broken, right? So maybe you could utilize "restored" flag?
>
> Thanks,
> Ruslan
>
> On 05/12/2015 09:59 PM, Ross Boucher wrote:
>
> In order to get support working in my application, I've resorted to a hack
> that works but is almost certainly not the best way to do things. I'm
> interested if anyone has suggestions for a better way. First, let me
> explain how it works.
>
>  The process I'm checkpointing is a node.js process that opens a socket,
> and waits for a connection on that socket. Once established, the connecting
> process sends code for the node.js process to evaluate, in a loop. The node
> process is checkpointed between every message containing new code to
> evaluate.
>
>  Now, when we restore, it is always a completely new process sending code
> to the node.js process, so the built in tcp socket restoration won't work.
> We had lots of difficulty figuring out how to detect that the socket
> connection had been broken. Ultimately, the hack we ended up using was to
> simply loop forever on a separate thread checking the time, and noticing if
> an unexplained huge gap in time had occurred. The looping thread looks like
> this:
>
>
>   void * canceler(void * threadPointer)
>  {
>      pthread_t thread = *(pthread_t *)threadPointer;
>
>       time_t start,end;
>      time(&start);
>
>       while(true)
>      {
>          usleep(1000);
>          time(&end);
>          double diff = difftime(end,start);
>
>           if (diff > 1.0) {
>               // THIS IS ALMOST CERTAINLY A RESTORE
>              break;
>          }
>      }
>
>       // cancel the read thread
>
>      int result = pthread_cancel(thread);
>
>       return NULL;
>
>  }
>
>
>
>  Elsewhere, in the code that actually does the reading, we spawn this
> thread with a handle to the read thread:
>
>   pthread_create(&cancelThread, NULL, canceler, (void *)readThread);
>
>
>
>  The rest of our code understand how to deal with a broken connection and
> is able to seamlessly reconnect. This is all working well, but it seems
> like there is probably a better way so I wanted to ask for suggestions. I
> also tried getting things to work with a file based socket rather than a
> TCP socket, but that proved even more difficult (and was far more
> complicated in our architecture anyway, so I'd prefer not to return down
> that path).
>
>  - Ross
>
>  [1] From my other email thread, this video might help illustrate the
> actual process going on, if my description isn't that clear:
>
>  https://www.youtube.com/watch?v=F2L6JLFuFWs&feature=youtu.be
>
>
>
>
> _______________________________________________
> CRIU mailing listCRIU at openvz.orghttps://lists.openvz.org/mailman/listinfo/criu
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20150512/ab504305/attachment.html>


More information about the CRIU mailing list