[CRIU] Signalling processes before CRIU/after unCRIU

Pavel Emelyanov xemul at parallels.com
Wed Oct 10 10:08:17 EDT 2012


On 10/10/2012 06:01 PM, Alex/AT wrote:
> Pavel Emelyanov <xemul at parallels.com> писал(а) в своём письме Wed, 10 Oct  
> 2012 17:46:12 +0400:
> 
>> Consider you have not one task, but a process tree. Will you wait for a  
>> timeout on each, or give one timeout for them all to respond?
> Yes, you're right.
> 
> I assume the parent process must be signalled in such a case (and  
> suspended as a tree starting from the parent as well). If the software  
> knows about CRIU and honors graceful suspend mechanism, it must have ways  
> to communicate suspension downstream.

When we dump a container should we make sure systemd knows how to talk
to the rest of the zoo? This sounds ... strange.

> Suspending child process without suspending all tree at once makes no  
> sense to me, though I must honestly say I did not give this possibility  
> much thought. It will almost surely break parent-child communication, and  
> probably cause eviction of the child process from parent's pool. It may  
> even wreak some havoc in worst cases.
> 
> Also this means all process tree must be execution-suspended at once the  

It's impossible. Tasks are suspended one-by-one.

> time the parent is ready to be suspended, or else it will cause  
> parent-child communication issues in some corner cases.
> 
>> Now the crtools are launched again in a short period of time do try dump  
>> the task again, send the "START SUSPEND" and suddenly receive "COMPLETE"  
>> that was sent previously, but stuck somewhere in the signaling engine.  
>> What do we do?
> 
> That's true. Such occurences can be avoided using unique IDs in  
> request/response. Also, it must be stated that processes must honor  
> _consecutive_ suspend requests without "WELCOME BACK" messages.
> 
> This incurs one more problem: what if the process gets "START SUSPEND",  
> but never gets "WELCOME BACK" (suspend aborted). That can be solved by  
> library-linking, as was mentioned by you before: just when we do  
> "COMPLETE" callback, we get execution-suspended, and return only after we  
> are unsuspended, or the suspend is aborted. This also redeems the need for  
> "WELCOME BACK" message, it is just assumed on the "COMPLETE" callback  
> completion. This process gets completely synchronous.
> 
> I think though that the housekeeping start ("START  
> SUSPEND"-housekeeping-"COMPLETE") is better done asynchronous. FSM worker  
> processes (like i.e. nginx) may benefit from that way of doing things  
> because they won't need to spawn separate threads or do tricky things to  
> complete housekeeping in the "START SUSPEND" callback.

Frankly, I don't want to re-invent TCP for such a simple case. Let's better
look at what dbus engine may give to us, it should already provide reasonable
service of that type.

Thanks,
Pavel



More information about the CRIU mailing list