[CRIU] Alternative to hacky resume detection
Ruslan Kuprieiev
kupruser at gmail.com
Tue May 12 14:46:39 PDT 2015
Oh, so the whole container is being dumped and not only that one process?
Hm, you might be able to just call criu_dump on whole container
from within that process just as I showed you in code below(but specify
container
pid) and get same results. The way that that return 1 in criu_dump works
is criu
puts a proper response packet into that service socket when restoring a
process tree,
so everything should work.
On 05/13/2015 12:36 AM, Ross Boucher wrote:
> That's an interesting idea. Though, my process is inside of a docker
> container, and I think it would get upset by being restored into a
> different container. I think I need the coordination docker is doing
> in order for my system to work.
>
> On Tue, May 12, 2015 at 2:27 PM, Ruslan Kuprieiev <kupruser at gmail.com
> <mailto:kupruser at gmail.com>> wrote:
>
> I'm saying that you might want to consider calling criu_dump()
> from a process that you are
> trying to dump. We call it self dump[1]. For example, using
> criu_dump() from libcriu it might look like:
>
> ...
> while (1) {
> ret = criu_dump();
> if (ret < 0) {
> /*error*/
> } else if (ret == 0) {
> /*dump is ok*/
> } else if (ret == 1) {
> /*This process is restored*/
> /*reestablish connection or do whatever needs to be done
> * in case of broken connection */
> }
> /*accept connection and evaluate code*/
> }
> ...
>
> [1] http://criu.org/Self_dump
>
>
>
> On 05/12/2015 11:25 PM, Ross Boucher wrote:
>> I'm not sure I follow. You're saying, the process that actually
>> calls restore would get notified? Or, are you saying that somehow
>> in the restored process I can access something set by criu?
>>
>> Assuming the former, I don't think that's necessary -- I already
>> know that I've just restored the process. I could try to send a
>> signal from the coordinating process and then use that signal to
>> cancel the read thread, which would be mostly the same thing. But
>> because that would have to travel through quite a few layers, it
>> seems like it would be better and more performant to do it from
>> within the restored process itself.
>>
>> Perhaps I am just misunderstanding your suggestion though.
>>
>>
>> On Tue, May 12, 2015 at 12:37 PM, Ruslan Kuprieiev
>> <kupruser at gmail.com <mailto:kupruser at gmail.com>> wrote:
>>
>> Hi, Ross
>>
>> When restoring using RPC or Libcriu response message contains
>> "restored" field set to true,
>> that help process to detect if it was restored. You say that
>> every time you restore the connection
>> is broken, right? So maybe you could utilize "restored" flag?
>>
>> Thanks,
>> Ruslan
>>
>> On 05/12/2015 09:59 PM, Ross Boucher wrote:
>>> In order to get support working in my application, I've
>>> resorted to a hack that works but is almost certainly not
>>> the best way to do things. I'm interested if anyone has
>>> suggestions for a better way. First, let me explain how it
>>> works.
>>>
>>> The process I'm checkpointing is a node.js process that
>>> opens a socket, and waits for a connection on that socket.
>>> Once established, the connecting process sends code for the
>>> node.js process to evaluate, in a loop. The node process is
>>> checkpointed between every message containing new code to
>>> evaluate.
>>>
>>> Now, when we restore, it is always a completely new process
>>> sending code to the node.js process, so the built in tcp
>>> socket restoration won't work. We had lots of difficulty
>>> figuring out how to detect that the socket connection had
>>> been broken. Ultimately, the hack we ended up using was to
>>> simply loop forever on a separate thread checking the time,
>>> and noticing if an unexplained huge gap in time had
>>> occurred. The looping thread looks like this:
>>>
>>>
>>> void * canceler(void * threadPointer)
>>> {
>>> pthread_t thread = *(pthread_t *)threadPointer;
>>>
>>> time_t start,end;
>>> time(&start);
>>>
>>> while(true)
>>> {
>>> usleep(1000);
>>> time(&end);
>>> double diff = difftime(end,start);
>>>
>>> if (diff > 1.0) {
>>> // THIS IS ALMOST CERTAINLY A RESTORE
>>> break;
>>> }
>>> }
>>>
>>> // cancel the read thread
>>>
>>> int result = pthread_cancel(thread);
>>>
>>> return NULL;
>>>
>>> }
>>>
>>>
>>>
>>> Elsewhere, in the code that actually does the reading, we
>>> spawn this thread with a handle to the read thread:
>>>
>>> pthread_create(&cancelThread, NULL, canceler, (void
>>> *)readThread);
>>>
>>>
>>>
>>> The rest of our code understand how to deal with a broken
>>> connection and is able to seamlessly reconnect. This is all
>>> working well, but it seems like there is probably a better
>>> way so I wanted to ask for suggestions. I also tried getting
>>> things to work with a file based socket rather than a TCP
>>> socket, but that proved even more difficult (and was far
>>> more complicated in our architecture anyway, so I'd prefer
>>> not to return down that path).
>>>
>>> - Ross
>>>
>>> [1] From my other email thread, this video might help
>>> illustrate the actual process going on, if my description
>>> isn't that clear:
>>>
>>> https://www.youtube.com/watch?v=F2L6JLFuFWs&feature=youtu.be
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> CRIU mailing list
>>> CRIU at openvz.org <mailto:CRIU at openvz.org>
>>> https://lists.openvz.org/mailman/listinfo/criu
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20150513/6723d4b6/attachment-0001.html>
More information about the CRIU
mailing list