[CRIU] Signalling processes before CRIU/after unCRIU
Alex/AT
alex at alex-at.ru
Wed Oct 10 10:01:50 EDT 2012
Pavel Emelyanov <xemul at parallels.com> писал(а) в своём письме Wed, 10 Oct
2012 17:46:12 +0400:
> Consider you have not one task, but a process tree. Will you wait for a
> timeout on each, or give one timeout for them all to respond?
Yes, you're right.
I assume the parent process must be signalled in such a case (and
suspended as a tree starting from the parent as well). If the software
knows about CRIU and honors graceful suspend mechanism, it must have ways
to communicate suspension downstream.
Suspending child process without suspending all tree at once makes no
sense to me, though I must honestly say I did not give this possibility
much thought. It will almost surely break parent-child communication, and
probably cause eviction of the child process from parent's pool. It may
even wreak some havoc in worst cases.
Also this means all process tree must be execution-suspended at once the
time the parent is ready to be suspended, or else it will cause
parent-child communication issues in some corner cases.
> Now the crtools are launched again in a short period of time do try dump
> the task again, send the "START SUSPEND" and suddenly receive "COMPLETE"
> that was sent previously, but stuck somewhere in the signaling engine.
> What do we do?
That's true. Such occurences can be avoided using unique IDs in
request/response. Also, it must be stated that processes must honor
_consecutive_ suspend requests without "WELCOME BACK" messages.
This incurs one more problem: what if the process gets "START SUSPEND",
but never gets "WELCOME BACK" (suspend aborted). That can be solved by
library-linking, as was mentioned by you before: just when we do
"COMPLETE" callback, we get execution-suspended, and return only after we
are unsuspended, or the suspend is aborted. This also redeems the need for
"WELCOME BACK" message, it is just assumed on the "COMPLETE" callback
completion. This process gets completely synchronous.
I think though that the housekeeping start ("START
SUSPEND"-housekeeping-"COMPLETE") is better done asynchronous. FSM worker
processes (like i.e. nginx) may benefit from that way of doing things
because they won't need to spawn separate threads or do tricky things to
complete housekeeping in the "START SUSPEND" callback.
--
Regards,
Alexey Asemov
More information about the CRIU
mailing list