[CRIU] Alternative to hacky resume detection

Ross Boucher rboucher at gmail.com
Tue May 12 14:48:55 PDT 2015


The container is only running the one process, but I have pools of
identical containers, and checkpoint/restore into ones unpredictably -- so
the underlying things like mount points and file descriptors would change,
which is what I'm using docker to manage.

On Tue, May 12, 2015 at 2:46 PM, Ruslan Kuprieiev <kupruser at gmail.com>
wrote:

>  Oh, so the whole container is being dumped and not only that one process?
> Hm, you might be able to just call criu_dump on whole container
> from within that process just as I showed you in code below(but specify
> container
> pid) and get same results. The way that that return 1 in criu_dump works
> is criu
> puts a proper response packet into that service socket when restoring a
> process tree,
> so everything should work.
>
>
> On 05/13/2015 12:36 AM, Ross Boucher wrote:
>
> That's an interesting idea. Though, my process is inside of a docker
> container, and I think it would get upset by being restored into a
> different container. I think I need the coordination docker is doing in
> order for my system to work.
>
> On Tue, May 12, 2015 at 2:27 PM, Ruslan Kuprieiev <kupruser at gmail.com>
> wrote:
>
>>  I'm saying that you might want to consider calling criu_dump() from a
>> process that you are
>> trying to dump. We call it self dump[1]. For example, using criu_dump()
>> from libcriu it might look like:
>>
>> ...
>> while (1) {
>>     ret = criu_dump();
>>     if (ret < 0) {
>>         /*error*/
>>     } else if (ret == 0) {
>>        /*dump is ok*/
>>     } else if (ret == 1) {
>>       /*This process is restored*/
>>       /*reestablish connection or do whatever needs to be done
>>        * in case of broken connection */
>>     }
>>     /*accept connection and evaluate code*/
>> }
>> ...
>>
>> [1] http://criu.org/Self_dump
>>
>>
>>
>> On 05/12/2015 11:25 PM, Ross Boucher wrote:
>>
>> I'm not sure I follow. You're saying, the process that actually calls
>> restore would get notified? Or, are you saying that somehow in the restored
>> process I can access something set by criu?
>>
>>  Assuming the former, I don't think that's necessary -- I already know
>> that I've just restored the process. I could try to send a signal from the
>> coordinating process and then use that signal to cancel the read thread,
>> which would be mostly the same thing. But because that would have to travel
>> through quite a few layers, it seems like it would be better and more
>> performant to do it from within the restored process itself.
>>
>>  Perhaps I am just misunderstanding your suggestion though.
>>
>>
>> On Tue, May 12, 2015 at 12:37 PM, Ruslan Kuprieiev <kupruser at gmail.com>
>> wrote:
>>
>>>  Hi, Ross
>>>
>>> When restoring using RPC or Libcriu response message contains "restored"
>>> field set to true,
>>> that help process to detect if it was restored. You say that every time
>>> you restore the connection
>>> is broken, right? So maybe you could utilize "restored" flag?
>>>
>>> Thanks,
>>> Ruslan
>>>
>>> On 05/12/2015 09:59 PM, Ross Boucher wrote:
>>>
>>>  In order to get support working in my application, I've resorted to a
>>> hack that works but is almost certainly not the best way to do things. I'm
>>> interested if anyone has suggestions for a better way. First, let me
>>> explain how it works.
>>>
>>>  The process I'm checkpointing is a node.js process that opens a
>>> socket, and waits for a connection on that socket. Once established, the
>>> connecting process sends code for the node.js process to evaluate, in a
>>> loop. The node process is checkpointed between every message containing new
>>> code to evaluate.
>>>
>>>  Now, when we restore, it is always a completely new process sending
>>> code to the node.js process, so the built in tcp socket restoration won't
>>> work. We had lots of difficulty figuring out how to detect that the socket
>>> connection had been broken. Ultimately, the hack we ended up using was to
>>> simply loop forever on a separate thread checking the time, and noticing if
>>> an unexplained huge gap in time had occurred. The looping thread looks like
>>> this:
>>>
>>>
>>>   void * canceler(void * threadPointer)
>>>  {
>>>      pthread_t thread = *(pthread_t *)threadPointer;
>>>
>>>       time_t start,end;
>>>      time(&start);
>>>
>>>       while(true)
>>>      {
>>>          usleep(1000);
>>>          time(&end);
>>>          double diff = difftime(end,start);
>>>
>>>           if (diff > 1.0) {
>>>               // THIS IS ALMOST CERTAINLY A RESTORE
>>>              break;
>>>          }
>>>      }
>>>
>>>       // cancel the read thread
>>>
>>>      int result = pthread_cancel(thread);
>>>
>>>       return NULL;
>>>
>>>  }
>>>
>>>
>>>
>>>  Elsewhere, in the code that actually does the reading, we spawn this
>>> thread with a handle to the read thread:
>>>
>>>   pthread_create(&cancelThread, NULL, canceler, (void *)readThread);
>>>
>>>
>>>
>>>  The rest of our code understand how to deal with a broken connection
>>> and is able to seamlessly reconnect. This is all working well, but it seems
>>> like there is probably a better way so I wanted to ask for suggestions. I
>>> also tried getting things to work with a file based socket rather than a
>>> TCP socket, but that proved even more difficult (and was far more
>>> complicated in our architecture anyway, so I'd prefer not to return down
>>> that path).
>>>
>>>  - Ross
>>>
>>>  [1] From my other email thread, this video might help illustrate the
>>> actual process going on, if my description isn't that clear:
>>>
>>>  https://www.youtube.com/watch?v=F2L6JLFuFWs&feature=youtu.be
>>>
>>>
>>>
>>>
>>>  _______________________________________________
>>> CRIU mailing listCRIU at openvz.orghttps://lists.openvz.org/mailman/listinfo/criu
>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20150512/5bd54131/attachment.html>


More information about the CRIU mailing list