[CRIU] About re-parenting

Pavel Emelyanov xemul at parallels.com
Tue Mar 25 06:25:48 PDT 2014


On 03/25/2014 05:08 PM, Adrian Reber wrote:
> On Tue, Mar 25, 2014 at 01:54:53AM +0400, Pavel Emelyanov wrote:
>> On 03/25/2014 01:21 AM, Adrian Reber wrote:
>>> Thanks so far for all your help. I am still unsure how to handle
>>> restarting in Open MPI in regards to stdin/stdout redirection and
>>> re-parenting. This is a partly to the fact that I still do not
>>> understand it completely. Too many too complex components (Open MPI and
>>> criu). I think the stdin/stdout problem can be solved but I am not sure
>>> how the re-parenting can/must/should work.
>>>
>>> Open MPI has orte-restart which analyzes the metadata from the previous
>>> checkpoint and then starts mpirun with the correct parameters. mpirun
>>> starts the corresponding numbers of child processes by starting multiple
>>> copies of opal-restart. opal-restart then tries to restart the
>>> checkpointed process using CRIU. What it expects is that after some
>>> initialization it calls criu_restore() and is then replaced by the
>>> checkpointed process like exec() would do.
>>
>> Hmm... If I get it right, the restore process might look like this.
>>
>> 1. someone exec()-utes orte-restart with options
>> 2. orte-restart exec()-utes criu with suid bit on it and with action
>>    "restore" and exec-cmd (recently committed this patch from Deyan) 
>>    telling that after restore it should proceed to step 4
>> 3. criu does regular restore process. With suid bit it should be possible
>> 4. criu calls exec() on orte-restart again with options telling it that
>>    there's a new subtree alive under it
>>
>> Service would not be suitable for that, as it can only create detached
>> subtree, we don't have any API in the kernel to re-parent tasks :(
> 
> Okay. Good to know.
> 
>> IOW the syscalls and process tree would look like this
>>
>> 1. exec("orte-restart", "restore", ...)
>>
>>    12 open-mpi-engine
>>    13  `- orte-restart
>>
>> 2. orte calls criu with exec("criu", "restore", ...)
>>
>>    12 open-mpi-engine
>>    13  `- criu restore
>>
>> 3. criu does restoring -- forking tasks and restoring it
>>
>>    12 open-mpi-engine
>>    13  `- criu restore
>>    125     `- my-openmpi-process
>>
>> 4. criu calls exec("orte-restart", "continue-watching-your-kids", ...)
>>
>>    12 open-mpi-engine
>>    13  `- orte-restart
>>    125     `- my-openmpi-process
>>
>> Does it look like what we want? The question how to preserve the pipes is
>> still open, but let's sort out how to restore the process linkage first.
> 
> This is almost what we want and I agree resolving process linkage is the
> more important part.
> 
> At the end it should look like this:
> 
>     12 open-mpi-engine
>     13  `- my-openmpi-process

Hm, but the pid of my-openmpi-process should be not generated, but
restored to the exact value it used to have, i.e. the tree should
look like

     12 open-mpi-engine
     125  `- my-openmpi-process

> The restart process should be replaced by the process which has been
> restored and the restored process should be the child of the Open MPI
> runtime. 

I see. In other words the orte-restore should transform itself into the
process we want to restore, right?

Thanks,
Pavel


More information about the CRIU mailing list