[CRIU] criu_restore() in Open MPI problems

Andrew Vagin avagin at parallels.com
Mon Mar 24 04:07:11 PDT 2014


On Wed, Mar 19, 2014 at 03:05:45PM +0100, Adrian Reber wrote:
> On Wed, Mar 19, 2014 at 05:41:40PM +0400, Andrew Vagin wrote:
> > On Wed, Mar 19, 2014 at 11:29:51AM +0100, Adrian Reber wrote:
> > > On Wed, Mar 19, 2014 at 12:33:30PM +0400, Andrew Vagin wrote:
> > > > Could you try out the attached patch?
> > > 
> > > With this patch it actually tries to restore process but fails with:
> > > 
> > > (00.026193)  15852: tty: open type pts id 0x2 index 11 (master 0 sid 0 pgrp 0 inherit 1)
> > > (00.026198)  15852: Error (tty.c:541): tty: Can't dup SELF_STDIN_OFF: Bad file descriptor
> > > (00.026782) Error (cr-restore.c:1035): 15852 exited, status=255
> > > (00.026810) Error (cr-restore.c:1577): Restoring FAILED.
> > > 
> > > Full log at http://lisas.de/~adrian/criu.log
> > > 
> > > Which is probably related to the way Open MPI handles stdout/stderr of
> > > its child processes. I need to find out how this exactly works.
> > 
> > As far as I understand you are executing criu as a service, aren't you?
> 
> Yes. Criu as a service and libcriu linked into Open MPI. The code is
> something like:
> 
>     criu_set_images_dir_fd(fd);
> 
>     criu_set_log_file(mca_crs_criu_component.log_file);
>     criu_set_log_level(mca_crs_criu_component.log_level);
>     criu_set_tcp_established(mca_crs_criu_component.tcp_established);
>     criu_set_shell_job(mca_crs_criu_component.shell_job);
>     criu_set_ext_unix_sk(mca_crs_criu_component.ext_unix_sk);
>     criu_set_leave_running(mca_crs_criu_component.leave_running);
> 
>     ret = criu_restore();
> 
> 
> > We have understood that the shell_job option on restore can't work
> > correctly in this case, because a link on parent and a session can't be
> > restored correctly. Both this parameters can be inhereted and can not be
> > set.
> > 
> > Looks like we have only one way to execute "criu restore" directly.
> > Maybe we will need to set the suid bit on criu, because it requires
> > CAP_SYS_ADMIN and CAP_SYS_RESOURCE.
> > 
> > Adrian, I want to know a bit more about structure of a process tree,
> > could you provide a bit more info:
> > 
> > * ps axf -o sid,gid,pid,cmd,uid,gid
> 
>  9042   500 19148   500   500  |   \_ /home/adrian/devel/openmpi-trunk/bin/orterun --am ft-enable-cr -np 1 orte-test2
>  9042   500 19151   500   500  |       \_ orte-test2
> 
> 
> > * lsof for a process and its parent
> 
> [adrian at dcbz ~]$ lsof -p 19148
> COMMAND   PID   USER   FD      TYPE             DEVICE SIZE/OFF     NODE NAME
> orterun 19148 adrian  cwd       DIR                8,6     4096  4276256 /home/adrian/devel/mpitest
> orterun 19148 adrian  rtd       DIR                8,6     4096        2 /
> orterun 19148 adrian  txt       REG                8,6   130809  4853952 /home/adrian/devel/openmpi-trunk/bin/orterun
> orterun 19148 adrian  mem       REG                8,6    57976  6436012 /usr/lib64/libnss_files-2.18.so
> orterun 19148 adrian  mem       REG                8,6    69712  6457664 /usr/lib64/libprotobuf-c.so.0.0.0
> orterun 19148 adrian  mem       REG                8,6  2097264  6426159 /usr/lib64/libc-2.18.so
> orterun 19148 adrian  mem       REG                8,6   147544  6439986 /usr/lib64/libpthread-2.18.so
> orterun 19148 adrian  mem       REG                8,6  1159944  6435339 /usr/lib64/libm-2.18.so
> orterun 19148 adrian  mem       REG                8,6    14608  6440555 /usr/lib64/libutil-2.18.so
> orterun 19148 adrian  mem       REG                8,6   113320  6435410 /usr/lib64/libnsl-2.18.so
> orterun 19148 adrian  mem       REG                8,6    44048  6440309 /usr/lib64/librt-2.18.so
> orterun 19148 adrian  mem       REG                8,6    19512  6433536 /usr/lib64/libdl-2.18.so
> orterun 19148 adrian  mem       REG                8,6    31832  6422555 /usr/lib64/libcriu.so.1.0
> orterun 19148 adrian  mem       REG                8,6  2615952  4725410 /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0.0.0
> orterun 19148 adrian  mem       REG                8,6  5260036  4726410 /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0.0.0
> orterun 19148 adrian  mem       REG                8,6   154992  6422554 /usr/lib64/ld-2.18.so
> orterun 19148 adrian    0u      CHR              136,6      0t0        9 /dev/pts/6
> orterun 19148 adrian    1u      CHR              136,6      0t0        9 /dev/pts/6
> orterun 19148 adrian    2u      CHR              136,6      0t0        9 /dev/pts/6
> orterun 19148 adrian    3u     unix 0xffff8802106faa00      0t0 10591539 socket
> orterun 19148 adrian    4u     unix 0xffff8802106fa680      0t0 10591540 socket
> orterun 19148 adrian    5u  a_inode                0,9        0     7173 [eventfd]
> orterun 19148 adrian    6u      REG               0,17        0 10591541 /dev/shm/open_mpi.0000 (deleted)
> orterun 19148 adrian    7r     FIFO                0,8      0t0 10591543 pipe
> orterun 19148 adrian    8w     FIFO                0,8      0t0 10591543 pipe
> orterun 19148 adrian    9r      DIR                8,6     4096        2 /
> orterun 19148 adrian   10u     IPv4           10590043      0t0      TCP *:53823 (LISTEN)
> orterun 19148 adrian   11r     FIFO               0,32      0t0 10590050 /tmp/openmpi-sessions-adrian at dcbz_0/42683/0/debugger_attach_fifo
> orterun 19148 adrian   12u      CHR                5,2      0t0     8661 /dev/ptmx
> orterun 19148 adrian   13u     IPv4           10592260      0t0      TCP edur0000.hs-esslingen.de:53823->edur0000.hs-esslingen.de:47855 (ESTABLISHED)
> orterun 19148 adrian   15w     FIFO                0,8      0t0 10590053 pipe
> orterun 19148 adrian   16r     FIFO                0,8      0t0 10590054 pipe
> orterun 19148 adrian   18r     FIFO                0,8      0t0 10590055 pipe
> [adrian at dcbz ~]$ lsof -p 19151
> COMMAND     PID   USER   FD      TYPE             DEVICE SIZE/OFF     NODE NAME
> orte-test 19151 adrian  cwd       DIR                8,6     4096  4276256 /home/adrian/devel/mpitest
> orte-test 19151 adrian  rtd       DIR                8,6     4096        2 /
> orte-test 19151 adrian  txt       REG                8,6     8550  4241596 /home/adrian/devel/mpitest/orte-test2
> orte-test 19151 adrian  mem       REG                8,6    57976  6436012 /usr/lib64/libnss_files-2.18.so
> orte-test 19151 adrian  mem       REG                8,6    69712  6457664 /usr/lib64/libprotobuf-c.so.0.0.0
> orte-test 19151 adrian  mem       REG                8,6  1159944  6435339 /usr/lib64/libm-2.18.so
> orte-test 19151 adrian  mem       REG                8,6    14608  6440555 /usr/lib64/libutil-2.18.so
> orte-test 19151 adrian  mem       REG                8,6   113320  6435410 /usr/lib64/libnsl-2.18.so
> orte-test 19151 adrian  mem       REG                8,6    44048  6440309 /usr/lib64/librt-2.18.so
> orte-test 19151 adrian  mem       REG                8,6    19512  6433536 /usr/lib64/libdl-2.18.so
> orte-test 19151 adrian  mem       REG                8,6    31832  6422555 /usr/lib64/libcriu.so.1.0
> orte-test 19151 adrian  mem       REG                8,6  2615952  4725410 /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0.0.0
> orte-test 19151 adrian  mem       REG                8,6  5260036  4726410 /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0.0.0
> orte-test 19151 adrian  mem       REG                8,6  2097264  6426159 /usr/lib64/libc-2.18.so
> orte-test 19151 adrian  mem       REG                8,6   147544  6439986 /usr/lib64/libpthread-2.18.so
> orte-test 19151 adrian  mem       REG                8,6 19630134  4725415 /home/adrian/devel/openmpi-trunk/lib/libmpi.so.0.0.0
> orte-test 19151 adrian  mem       REG                8,6   154992  6422554 /usr/lib64/ld-2.18.so
> orte-test 19151 adrian    0r     FIFO                0,8      0t0 10590053 pipe
> orte-test 19151 adrian    1u      CHR             136,11      0t0       14 /dev/pts/11
> orte-test 19151 adrian    2w     FIFO                0,8      0t0 10590054 pipe
> orte-test 19151 adrian    3u     unix 0xffff8803ea4f7800      0t0 10590635 socket
> orte-test 19151 adrian    4u     unix 0xffff8803ea4f7480      0t0 10590636 socket
> orte-test 19151 adrian    5u  a_inode                0,9        0     7173 [eventfd]
> orte-test 19151 adrian    6u      REG               0,17        0 10590637 /dev/shm/open_mpi.0000 (deleted)
> orte-test 19151 adrian    7u     unix 0xffff8801a2044000      0t0 10590639 socket
> orte-test 19151 adrian    8u     unix 0xffff8801a2046d80      0t0 10590640 socket
> orte-test 19151 adrian    9u  a_inode                0,9        0     7173 [eventfd]
> orte-test 19151 adrian   10u     IPv4           10590642      0t0      TCP *:38026 (LISTEN)
> orte-test 19151 adrian   11u     IPv4           10591547      0t0      TCP edur0000.hs-esslingen.de:47855->edur0000.hs-esslingen.de:53823 (ESTABLISHED)
> orte-test 19151 adrian   12u     IPv4           10590649      0t0      TCP *:1024 (LISTEN)
> orte-test 19151 adrian   19w     FIFO                0,8      0t0 10590055 pipe
> [adrian at dcbz ~]$ 
>

Hello Adrian,

You can see that orterun and ort-test2 have tree common pipes. They are
created by orterun. As I understand, orterun is not dumped, so
these pipes are external resources for CRIU and we will need to write a
plugin for restoring them.

I think the restore scheme should look like this:
We run orterun, which prepare pipes and executes "CRIU restore".
The OpenMPI plugin takes preparate pipes and restores them in a proper
file descriptors.

> 
> > Thanks.
> > 
> > > 
> > > > On Wed, Mar 19, 2014 at 12:19:43PM +0400, Andrew Vagin wrote:
> > > > > On Tue, Mar 18, 2014 at 10:42:41PM +0400, Cyrill Gorcunov wrote:
> > > > > > On Tue, Mar 18, 2014 at 07:22:55PM +0100, Adrian Reber wrote:
> > > > > > > On Tue, Mar 18, 2014 at 09:15:04PM +0400, Cyrill Gorcunov wrote:
> > > > > > > > On Tue, Mar 18, 2014 at 06:03:18PM +0100, Adrian Reber wrote:
> > > > > > > > > Now that dumping works from Open MPII am trying to restore.
> > > > > > > > > Right now it fails with:
> > > > > > > > > 
> > > > > > > > > (00.000119) TCP queue memory limits are 2097152:3145728
> > > > > > > > > (00.000303) cpu: fpu:1 fxsr:1 xsave:1
> > > > > > > > > (00.000399) vdso: Parsing at 7fff84c27000 7fff84c29000
> > > > > > > > > (00.000407) vdso: Base address ffffffffff700000
> > > > > > > > > (00.000440) Reading image tree
> > > > > > > > > (00.000468) Migrating process tree (GID 25983->29676 SID 9042->29676)
> > > > > > > > > (00.000475) Will restore in 0 namespaces
> > > > > > > > > (00.000479) NS mask to use 0
> > > > > > > > > (00.000487) Collecting 41/21 (flags 0)
> > > > > > > > > (00.000514)  `- ... done
> > > > > > > > > (00.000520) Error (tty.c:1213): tty: Standard stream is not a terminal, aborting
> > > > > > > > > 
> > > > > > > > > I am not sure what this really means, but I suspect it has to do
> > > > > > > > > something with dumping with criu_set_shell_job(true) and restoring from
> > > > > > > > > inside a program instead of the command line. Running the command line
> > > > > > > > > tool instead of the criu_restore() works much better but fails in the
> > > > > > > > > end with:
> > > > > > > > 
> > > > > > > > Have you been dumping with --shell_job option? If yes, would it do the
> > > > > > > > trick without this option?
> > > > > > > 
> > > > > > > Yes, I dumped with the shell_job option. Without shell_job it does not dump:
> > > > > > > 
> > > > > > > Error (pstree.c:196): The root process 26660 is not a session leader.  Consider using --shell-job option
> > > > > > 
> > > > > > Heh ;-) Could you please show ls -l /proc/<pid>/fd where <pid> is the pid of a process
> > > > > > you're dumping (and also try dump with -v4 --shell-job and show complete dump log).
> > > > > 
> > > > > Steps to reproduce:
> > > > > sleep 1000 &> /dev/null < /dev/null &
> > > > > ./criu dump -t $! -D tmp --shell-job
> > > > > ./criu restore -D tmp -o r.log --shell-job < /dev/null &> /dev/null
> > > > > 
> > > > > shell-job tries to find a current terminal even if it is not required
> > > > > for restore.
> > > > > 
> > > > > > _______________________________________________
> > > > > > CRIU mailing list
> > > > > > CRIU at openvz.org
> > > > > > https://lists.openvz.org/mailman/listinfo/criu
> > > 
> > > > diff --git a/tty.c b/tty.c
> > > > index 5fca74c..660b847 100644
> > > > --- a/tty.c
> > > > +++ b/tty.c
> > > > @@ -1215,8 +1215,8 @@ int tty_prep_fds(void)
> > > >  		return 0;
> > > >  
> > > >  	if (!isatty(STDIN_FILENO)) {
> > > > -		pr_err("Standard stream is not a terminal, aborting\n");
> > > > -		return -1;
> > > > +		pr_warn("Standard stream is not a terminal\n");
> > > > +		return 0;
> > > >  	}
> > > >  
> > > >  	if (install_service_fd(SELF_STDIN_OFF, STDIN_FILENO) < 0) {
> 
> 		Adrian
> 
> -- 
> Adrian Reber <adrian at lisas.de>            http://lisas.de/~adrian/
> Finagle's Second Law:
> 	No matter what the anticipated result, there will always be
> 	someone eager to (a) misinterpret it, (b) fake it, or (c) believe it
> 	happened according to his own pet theory.


More information about the CRIU mailing list