[CRIU] criu_restore() in Open MPI problems
Andrew Vagin
avagin at parallels.com
Mon Mar 24 04:07:11 PDT 2014
On Wed, Mar 19, 2014 at 03:05:45PM +0100, Adrian Reber wrote:
> On Wed, Mar 19, 2014 at 05:41:40PM +0400, Andrew Vagin wrote:
> > On Wed, Mar 19, 2014 at 11:29:51AM +0100, Adrian Reber wrote:
> > > On Wed, Mar 19, 2014 at 12:33:30PM +0400, Andrew Vagin wrote:
> > > > Could you try out the attached patch?
> > >
> > > With this patch it actually tries to restore process but fails with:
> > >
> > > (00.026193) 15852: tty: open type pts id 0x2 index 11 (master 0 sid 0 pgrp 0 inherit 1)
> > > (00.026198) 15852: Error (tty.c:541): tty: Can't dup SELF_STDIN_OFF: Bad file descriptor
> > > (00.026782) Error (cr-restore.c:1035): 15852 exited, status=255
> > > (00.026810) Error (cr-restore.c:1577): Restoring FAILED.
> > >
> > > Full log at http://lisas.de/~adrian/criu.log
> > >
> > > Which is probably related to the way Open MPI handles stdout/stderr of
> > > its child processes. I need to find out how this exactly works.
> >
> > As far as I understand you are executing criu as a service, aren't you?
>
> Yes. Criu as a service and libcriu linked into Open MPI. The code is
> something like:
>
> criu_set_images_dir_fd(fd);
>
> criu_set_log_file(mca_crs_criu_component.log_file);
> criu_set_log_level(mca_crs_criu_component.log_level);
> criu_set_tcp_established(mca_crs_criu_component.tcp_established);
> criu_set_shell_job(mca_crs_criu_component.shell_job);
> criu_set_ext_unix_sk(mca_crs_criu_component.ext_unix_sk);
> criu_set_leave_running(mca_crs_criu_component.leave_running);
>
> ret = criu_restore();
>
>
> > We have understood that the shell_job option on restore can't work
> > correctly in this case, because a link on parent and a session can't be
> > restored correctly. Both this parameters can be inhereted and can not be
> > set.
> >
> > Looks like we have only one way to execute "criu restore" directly.
> > Maybe we will need to set the suid bit on criu, because it requires
> > CAP_SYS_ADMIN and CAP_SYS_RESOURCE.
> >
> > Adrian, I want to know a bit more about structure of a process tree,
> > could you provide a bit more info:
> >
> > * ps axf -o sid,gid,pid,cmd,uid,gid
>
> 9042 500 19148 500 500 | \_ /home/adrian/devel/openmpi-trunk/bin/orterun --am ft-enable-cr -np 1 orte-test2
> 9042 500 19151 500 500 | \_ orte-test2
>
>
> > * lsof for a process and its parent
>
> [adrian at dcbz ~]$ lsof -p 19148
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
> orterun 19148 adrian cwd DIR 8,6 4096 4276256 /home/adrian/devel/mpitest
> orterun 19148 adrian rtd DIR 8,6 4096 2 /
> orterun 19148 adrian txt REG 8,6 130809 4853952 /home/adrian/devel/openmpi-trunk/bin/orterun
> orterun 19148 adrian mem REG 8,6 57976 6436012 /usr/lib64/libnss_files-2.18.so
> orterun 19148 adrian mem REG 8,6 69712 6457664 /usr/lib64/libprotobuf-c.so.0.0.0
> orterun 19148 adrian mem REG 8,6 2097264 6426159 /usr/lib64/libc-2.18.so
> orterun 19148 adrian mem REG 8,6 147544 6439986 /usr/lib64/libpthread-2.18.so
> orterun 19148 adrian mem REG 8,6 1159944 6435339 /usr/lib64/libm-2.18.so
> orterun 19148 adrian mem REG 8,6 14608 6440555 /usr/lib64/libutil-2.18.so
> orterun 19148 adrian mem REG 8,6 113320 6435410 /usr/lib64/libnsl-2.18.so
> orterun 19148 adrian mem REG 8,6 44048 6440309 /usr/lib64/librt-2.18.so
> orterun 19148 adrian mem REG 8,6 19512 6433536 /usr/lib64/libdl-2.18.so
> orterun 19148 adrian mem REG 8,6 31832 6422555 /usr/lib64/libcriu.so.1.0
> orterun 19148 adrian mem REG 8,6 2615952 4725410 /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0.0.0
> orterun 19148 adrian mem REG 8,6 5260036 4726410 /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0.0.0
> orterun 19148 adrian mem REG 8,6 154992 6422554 /usr/lib64/ld-2.18.so
> orterun 19148 adrian 0u CHR 136,6 0t0 9 /dev/pts/6
> orterun 19148 adrian 1u CHR 136,6 0t0 9 /dev/pts/6
> orterun 19148 adrian 2u CHR 136,6 0t0 9 /dev/pts/6
> orterun 19148 adrian 3u unix 0xffff8802106faa00 0t0 10591539 socket
> orterun 19148 adrian 4u unix 0xffff8802106fa680 0t0 10591540 socket
> orterun 19148 adrian 5u a_inode 0,9 0 7173 [eventfd]
> orterun 19148 adrian 6u REG 0,17 0 10591541 /dev/shm/open_mpi.0000 (deleted)
> orterun 19148 adrian 7r FIFO 0,8 0t0 10591543 pipe
> orterun 19148 adrian 8w FIFO 0,8 0t0 10591543 pipe
> orterun 19148 adrian 9r DIR 8,6 4096 2 /
> orterun 19148 adrian 10u IPv4 10590043 0t0 TCP *:53823 (LISTEN)
> orterun 19148 adrian 11r FIFO 0,32 0t0 10590050 /tmp/openmpi-sessions-adrian at dcbz_0/42683/0/debugger_attach_fifo
> orterun 19148 adrian 12u CHR 5,2 0t0 8661 /dev/ptmx
> orterun 19148 adrian 13u IPv4 10592260 0t0 TCP edur0000.hs-esslingen.de:53823->edur0000.hs-esslingen.de:47855 (ESTABLISHED)
> orterun 19148 adrian 15w FIFO 0,8 0t0 10590053 pipe
> orterun 19148 adrian 16r FIFO 0,8 0t0 10590054 pipe
> orterun 19148 adrian 18r FIFO 0,8 0t0 10590055 pipe
> [adrian at dcbz ~]$ lsof -p 19151
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
> orte-test 19151 adrian cwd DIR 8,6 4096 4276256 /home/adrian/devel/mpitest
> orte-test 19151 adrian rtd DIR 8,6 4096 2 /
> orte-test 19151 adrian txt REG 8,6 8550 4241596 /home/adrian/devel/mpitest/orte-test2
> orte-test 19151 adrian mem REG 8,6 57976 6436012 /usr/lib64/libnss_files-2.18.so
> orte-test 19151 adrian mem REG 8,6 69712 6457664 /usr/lib64/libprotobuf-c.so.0.0.0
> orte-test 19151 adrian mem REG 8,6 1159944 6435339 /usr/lib64/libm-2.18.so
> orte-test 19151 adrian mem REG 8,6 14608 6440555 /usr/lib64/libutil-2.18.so
> orte-test 19151 adrian mem REG 8,6 113320 6435410 /usr/lib64/libnsl-2.18.so
> orte-test 19151 adrian mem REG 8,6 44048 6440309 /usr/lib64/librt-2.18.so
> orte-test 19151 adrian mem REG 8,6 19512 6433536 /usr/lib64/libdl-2.18.so
> orte-test 19151 adrian mem REG 8,6 31832 6422555 /usr/lib64/libcriu.so.1.0
> orte-test 19151 adrian mem REG 8,6 2615952 4725410 /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0.0.0
> orte-test 19151 adrian mem REG 8,6 5260036 4726410 /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0.0.0
> orte-test 19151 adrian mem REG 8,6 2097264 6426159 /usr/lib64/libc-2.18.so
> orte-test 19151 adrian mem REG 8,6 147544 6439986 /usr/lib64/libpthread-2.18.so
> orte-test 19151 adrian mem REG 8,6 19630134 4725415 /home/adrian/devel/openmpi-trunk/lib/libmpi.so.0.0.0
> orte-test 19151 adrian mem REG 8,6 154992 6422554 /usr/lib64/ld-2.18.so
> orte-test 19151 adrian 0r FIFO 0,8 0t0 10590053 pipe
> orte-test 19151 adrian 1u CHR 136,11 0t0 14 /dev/pts/11
> orte-test 19151 adrian 2w FIFO 0,8 0t0 10590054 pipe
> orte-test 19151 adrian 3u unix 0xffff8803ea4f7800 0t0 10590635 socket
> orte-test 19151 adrian 4u unix 0xffff8803ea4f7480 0t0 10590636 socket
> orte-test 19151 adrian 5u a_inode 0,9 0 7173 [eventfd]
> orte-test 19151 adrian 6u REG 0,17 0 10590637 /dev/shm/open_mpi.0000 (deleted)
> orte-test 19151 adrian 7u unix 0xffff8801a2044000 0t0 10590639 socket
> orte-test 19151 adrian 8u unix 0xffff8801a2046d80 0t0 10590640 socket
> orte-test 19151 adrian 9u a_inode 0,9 0 7173 [eventfd]
> orte-test 19151 adrian 10u IPv4 10590642 0t0 TCP *:38026 (LISTEN)
> orte-test 19151 adrian 11u IPv4 10591547 0t0 TCP edur0000.hs-esslingen.de:47855->edur0000.hs-esslingen.de:53823 (ESTABLISHED)
> orte-test 19151 adrian 12u IPv4 10590649 0t0 TCP *:1024 (LISTEN)
> orte-test 19151 adrian 19w FIFO 0,8 0t0 10590055 pipe
> [adrian at dcbz ~]$
>
Hello Adrian,
You can see that orterun and ort-test2 have tree common pipes. They are
created by orterun. As I understand, orterun is not dumped, so
these pipes are external resources for CRIU and we will need to write a
plugin for restoring them.
I think the restore scheme should look like this:
We run orterun, which prepare pipes and executes "CRIU restore".
The OpenMPI plugin takes preparate pipes and restores them in a proper
file descriptors.
>
> > Thanks.
> >
> > >
> > > > On Wed, Mar 19, 2014 at 12:19:43PM +0400, Andrew Vagin wrote:
> > > > > On Tue, Mar 18, 2014 at 10:42:41PM +0400, Cyrill Gorcunov wrote:
> > > > > > On Tue, Mar 18, 2014 at 07:22:55PM +0100, Adrian Reber wrote:
> > > > > > > On Tue, Mar 18, 2014 at 09:15:04PM +0400, Cyrill Gorcunov wrote:
> > > > > > > > On Tue, Mar 18, 2014 at 06:03:18PM +0100, Adrian Reber wrote:
> > > > > > > > > Now that dumping works from Open MPII am trying to restore.
> > > > > > > > > Right now it fails with:
> > > > > > > > >
> > > > > > > > > (00.000119) TCP queue memory limits are 2097152:3145728
> > > > > > > > > (00.000303) cpu: fpu:1 fxsr:1 xsave:1
> > > > > > > > > (00.000399) vdso: Parsing at 7fff84c27000 7fff84c29000
> > > > > > > > > (00.000407) vdso: Base address ffffffffff700000
> > > > > > > > > (00.000440) Reading image tree
> > > > > > > > > (00.000468) Migrating process tree (GID 25983->29676 SID 9042->29676)
> > > > > > > > > (00.000475) Will restore in 0 namespaces
> > > > > > > > > (00.000479) NS mask to use 0
> > > > > > > > > (00.000487) Collecting 41/21 (flags 0)
> > > > > > > > > (00.000514) `- ... done
> > > > > > > > > (00.000520) Error (tty.c:1213): tty: Standard stream is not a terminal, aborting
> > > > > > > > >
> > > > > > > > > I am not sure what this really means, but I suspect it has to do
> > > > > > > > > something with dumping with criu_set_shell_job(true) and restoring from
> > > > > > > > > inside a program instead of the command line. Running the command line
> > > > > > > > > tool instead of the criu_restore() works much better but fails in the
> > > > > > > > > end with:
> > > > > > > >
> > > > > > > > Have you been dumping with --shell_job option? If yes, would it do the
> > > > > > > > trick without this option?
> > > > > > >
> > > > > > > Yes, I dumped with the shell_job option. Without shell_job it does not dump:
> > > > > > >
> > > > > > > Error (pstree.c:196): The root process 26660 is not a session leader. Consider using --shell-job option
> > > > > >
> > > > > > Heh ;-) Could you please show ls -l /proc/<pid>/fd where <pid> is the pid of a process
> > > > > > you're dumping (and also try dump with -v4 --shell-job and show complete dump log).
> > > > >
> > > > > Steps to reproduce:
> > > > > sleep 1000 &> /dev/null < /dev/null &
> > > > > ./criu dump -t $! -D tmp --shell-job
> > > > > ./criu restore -D tmp -o r.log --shell-job < /dev/null &> /dev/null
> > > > >
> > > > > shell-job tries to find a current terminal even if it is not required
> > > > > for restore.
> > > > >
> > > > > > _______________________________________________
> > > > > > CRIU mailing list
> > > > > > CRIU at openvz.org
> > > > > > https://lists.openvz.org/mailman/listinfo/criu
> > >
> > > > diff --git a/tty.c b/tty.c
> > > > index 5fca74c..660b847 100644
> > > > --- a/tty.c
> > > > +++ b/tty.c
> > > > @@ -1215,8 +1215,8 @@ int tty_prep_fds(void)
> > > > return 0;
> > > >
> > > > if (!isatty(STDIN_FILENO)) {
> > > > - pr_err("Standard stream is not a terminal, aborting\n");
> > > > - return -1;
> > > > + pr_warn("Standard stream is not a terminal\n");
> > > > + return 0;
> > > > }
> > > >
> > > > if (install_service_fd(SELF_STDIN_OFF, STDIN_FILENO) < 0) {
>
> Adrian
>
> --
> Adrian Reber <adrian at lisas.de> http://lisas.de/~adrian/
> Finagle's Second Law:
> No matter what the anticipated result, there will always be
> someone eager to (a) misinterpret it, (b) fake it, or (c) believe it
> happened according to his own pet theory.
More information about the CRIU
mailing list