[CRIU] Signalling processes before CRIU/after unCRIU

Alex/AT alex at alex-at.ru
Wed Oct 10 10:01:50 EDT 2012


Pavel Emelyanov <xemul at parallels.com> писал(а) в своём письме Wed, 10 Oct  
2012 17:46:12 +0400:

> Consider you have not one task, but a process tree. Will you wait for a  
> timeout on each, or give one timeout for them all to respond?
Yes, you're right.

I assume the parent process must be signalled in such a case (and  
suspended as a tree starting from the parent as well). If the software  
knows about CRIU and honors graceful suspend mechanism, it must have ways  
to communicate suspension downstream.

Suspending child process without suspending all tree at once makes no  
sense to me, though I must honestly say I did not give this possibility  
much thought. It will almost surely break parent-child communication, and  
probably cause eviction of the child process from parent's pool. It may  
even wreak some havoc in worst cases.

Also this means all process tree must be execution-suspended at once the  
time the parent is ready to be suspended, or else it will cause  
parent-child communication issues in some corner cases.

> Now the crtools are launched again in a short period of time do try dump  
> the task again, send the "START SUSPEND" and suddenly receive "COMPLETE"  
> that was sent previously, but stuck somewhere in the signaling engine.  
> What do we do?

That's true. Such occurences can be avoided using unique IDs in  
request/response. Also, it must be stated that processes must honor  
_consecutive_ suspend requests without "WELCOME BACK" messages.

This incurs one more problem: what if the process gets "START SUSPEND",  
but never gets "WELCOME BACK" (suspend aborted). That can be solved by  
library-linking, as was mentioned by you before: just when we do  
"COMPLETE" callback, we get execution-suspended, and return only after we  
are unsuspended, or the suspend is aborted. This also redeems the need for  
"WELCOME BACK" message, it is just assumed on the "COMPLETE" callback  
completion. This process gets completely synchronous.

I think though that the housekeeping start ("START  
SUSPEND"-housekeeping-"COMPLETE") is better done asynchronous. FSM worker  
processes (like i.e. nginx) may benefit from that way of doing things  
because they won't need to spawn separate threads or do tricky things to  
complete housekeeping in the "START SUSPEND" callback.

-- 
Regards,
Alexey Asemov



More information about the CRIU mailing list