[CRIU] About re-parenting
Adrian Reber
adrian at lisas.de
Tue Mar 25 06:08:35 PDT 2014
On Tue, Mar 25, 2014 at 01:54:53AM +0400, Pavel Emelyanov wrote:
> On 03/25/2014 01:21 AM, Adrian Reber wrote:
> > Thanks so far for all your help. I am still unsure how to handle
> > restarting in Open MPI in regards to stdin/stdout redirection and
> > re-parenting. This is a partly to the fact that I still do not
> > understand it completely. Too many too complex components (Open MPI and
> > criu). I think the stdin/stdout problem can be solved but I am not sure
> > how the re-parenting can/must/should work.
> >
> > Open MPI has orte-restart which analyzes the metadata from the previous
> > checkpoint and then starts mpirun with the correct parameters. mpirun
> > starts the corresponding numbers of child processes by starting multiple
> > copies of opal-restart. opal-restart then tries to restart the
> > checkpointed process using CRIU. What it expects is that after some
> > initialization it calls criu_restore() and is then replaced by the
> > checkpointed process like exec() would do.
>
> Hmm... If I get it right, the restore process might look like this.
>
> 1. someone exec()-utes orte-restart with options
> 2. orte-restart exec()-utes criu with suid bit on it and with action
> "restore" and exec-cmd (recently committed this patch from Deyan)
> telling that after restore it should proceed to step 4
> 3. criu does regular restore process. With suid bit it should be possible
> 4. criu calls exec() on orte-restart again with options telling it that
> there's a new subtree alive under it
>
> Service would not be suitable for that, as it can only create detached
> subtree, we don't have any API in the kernel to re-parent tasks :(
Okay. Good to know.
> IOW the syscalls and process tree would look like this
>
> 1. exec("orte-restart", "restore", ...)
>
> 12 open-mpi-engine
> 13 `- orte-restart
>
> 2. orte calls criu with exec("criu", "restore", ...)
>
> 12 open-mpi-engine
> 13 `- criu restore
>
> 3. criu does restoring -- forking tasks and restoring it
>
> 12 open-mpi-engine
> 13 `- criu restore
> 125 `- my-openmpi-process
>
> 4. criu calls exec("orte-restart", "continue-watching-your-kids", ...)
>
> 12 open-mpi-engine
> 13 `- orte-restart
> 125 `- my-openmpi-process
>
> Does it look like what we want? The question how to preserve the pipes is
> still open, but let's sort out how to restore the process linkage first.
This is almost what we want and I agree resolving process linkage is the
more important part.
At the end it should look like this:
12 open-mpi-engine
13 `- my-openmpi-process
The restart process should be replaced by the process which has been
restored and the restored process should be the child of the Open MPI
runtime.
> >>From what I understand from CRIU criu_restore() calls via RPC the criu
> > service which restarts the process specified. The problem how I
> > understand it is that the new process is not the child of the calling
> > process (opal-restart) but from the criu service and therefore it needs
> > to be re-parented from the criu service to mpirun which is running
> > opal-restart which is to expected to be replaced by the new process.
> >
> > I understand that LXC has similar problems which it tries to address by
> > the recent patch (on which I am also on CC, thanks).
>
> Yup. I've just applied it and pushed on github. Untill we do 1.3 release
> we can alter the API the way we want. And it's perfectly fine with me to
> delay the release as long as we need to make the --exec-cmd serve our
> needs.
I will test it in the next days. Although I am not 100% sure I need it.
Maybe I get what I need by just calling exec("criu", "restore", ...)
instead of using the library.
> > So right now I am still not sure what the correct way to solve this
> > problem is. I am writing to you mainly to get some feedback and I hope
> > that you understand how restarting is expected to work in Open MPI.
> > Unfortunately Open MPI is very complex, so that I still have not
> > completely understood how checkpoint/restarts should work but I will
> > continue to understand it and get CRIU working with it.
> >
> > Adrian
More information about the CRIU
mailing list