[CRIU] Alternative to hacky resume detection

Ruslan Kuprieiev kupruser at gmail.com
Tue May 12 14:46:39 PDT 2015


Oh, so the whole container is being dumped and not only that one process?
Hm, you might be able to just call criu_dump on whole container
from within that process just as I showed you in code below(but specify 
container
pid) and get same results. The way that that return 1 in criu_dump works 
is criu
puts a proper response packet into that service socket when restoring a 
process tree,
so everything should work.

On 05/13/2015 12:36 AM, Ross Boucher wrote:
> That's an interesting idea. Though, my process is inside of a docker 
> container, and I think it would get upset by being restored into a 
> different container. I think I need the coordination docker is doing 
> in order for my system to work.
>
> On Tue, May 12, 2015 at 2:27 PM, Ruslan Kuprieiev <kupruser at gmail.com 
> <mailto:kupruser at gmail.com>> wrote:
>
>     I'm saying that you might want to consider calling criu_dump()
>     from a process that you are
>     trying to dump. We call it self dump[1]. For example, using
>     criu_dump() from libcriu it might look like:
>
>     ...
>     while (1) {
>         ret = criu_dump();
>         if (ret < 0) {
>             /*error*/
>         } else if (ret == 0) {
>            /*dump is ok*/
>         } else if (ret == 1) {
>           /*This process is restored*/
>           /*reestablish connection or do whatever needs to be done
>            * in case of broken connection */
>         }
>         /*accept connection and evaluate code*/
>     }
>     ...
>
>     [1] http://criu.org/Self_dump
>
>
>
>     On 05/12/2015 11:25 PM, Ross Boucher wrote:
>>     I'm not sure I follow. You're saying, the process that actually
>>     calls restore would get notified? Or, are you saying that somehow
>>     in the restored process I can access something set by criu?
>>
>>     Assuming the former, I don't think that's necessary -- I already
>>     know that I've just restored the process. I could try to send a
>>     signal from the coordinating process and then use that signal to
>>     cancel the read thread, which would be mostly the same thing. But
>>     because that would have to travel through quite a few layers, it
>>     seems like it would be better and more performant to do it from
>>     within the restored process itself.
>>
>>     Perhaps I am just misunderstanding your suggestion though.
>>
>>
>>     On Tue, May 12, 2015 at 12:37 PM, Ruslan Kuprieiev
>>     <kupruser at gmail.com <mailto:kupruser at gmail.com>> wrote:
>>
>>         Hi, Ross
>>
>>         When restoring using RPC or Libcriu response message contains
>>         "restored" field set to true,
>>         that help process to detect if it was restored. You say that
>>         every time you restore the connection
>>         is broken, right? So maybe you could utilize "restored" flag?
>>
>>         Thanks,
>>         Ruslan
>>
>>         On 05/12/2015 09:59 PM, Ross Boucher wrote:
>>>         In order to get support working in my application, I've
>>>         resorted to a hack that works but is almost certainly not
>>>         the best way to do things. I'm interested if anyone has
>>>         suggestions for a better way. First, let me explain how it
>>>         works.
>>>
>>>         The process I'm checkpointing is a node.js process that
>>>         opens a socket, and waits for a connection on that socket.
>>>         Once established, the connecting process sends code for the
>>>         node.js process to evaluate, in a loop. The node process is
>>>         checkpointed between every message containing new code to
>>>         evaluate.
>>>
>>>         Now, when we restore, it is always a completely new process
>>>         sending code to the node.js process, so the built in tcp
>>>         socket restoration won't work. We had lots of difficulty
>>>         figuring out how to detect that the socket connection had
>>>         been broken. Ultimately, the hack we ended up using was to
>>>         simply loop forever on a separate thread checking the time,
>>>         and noticing if an unexplained huge gap in time had
>>>         occurred. The looping thread looks like this:
>>>
>>>
>>>             void * canceler(void * threadPointer)
>>>             {
>>>                 pthread_t thread = *(pthread_t *)threadPointer;
>>>
>>>                 time_t start,end;
>>>                 time(&start);
>>>
>>>                 while(true)
>>>                 {
>>>                     usleep(1000);
>>>                     time(&end);
>>>                     double diff = difftime(end,start);
>>>
>>>                     if (diff > 1.0) {
>>>                         // THIS IS ALMOST CERTAINLY A RESTORE
>>>                         break;
>>>                     }
>>>                 }
>>>
>>>                 // cancel the read thread
>>>
>>>                 int result = pthread_cancel(thread);
>>>
>>>                 return NULL;
>>>
>>>             }
>>>
>>>
>>>
>>>         Elsewhere, in the code that actually does the reading, we
>>>         spawn this thread with a handle to the read thread:
>>>
>>>             pthread_create(&cancelThread, NULL, canceler, (void
>>>             *)readThread);
>>>
>>>
>>>
>>>         The rest of our code understand how to deal with a broken
>>>         connection and is able to seamlessly reconnect. This is all
>>>         working well, but it seems like there is probably a better
>>>         way so I wanted to ask for suggestions. I also tried getting
>>>         things to work with a file based socket rather than a TCP
>>>         socket, but that proved even more difficult (and was far
>>>         more complicated in our architecture anyway, so I'd prefer
>>>         not to return down that path).
>>>
>>>         - Ross
>>>
>>>         [1] From my other email thread, this video might help
>>>         illustrate the actual process going on, if my description
>>>         isn't that clear:
>>>
>>>         https://www.youtube.com/watch?v=F2L6JLFuFWs&feature=youtu.be
>>>
>>>
>>>
>>>
>>>         _______________________________________________
>>>         CRIU mailing list
>>>         CRIU at openvz.org  <mailto:CRIU at openvz.org>
>>>         https://lists.openvz.org/mailman/listinfo/criu
>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20150513/6723d4b6/attachment-0001.html>


More information about the CRIU mailing list