[CRIU] criu_restore() in Open MPI problems

Adrian Reber adrian at lisas.de
Wed Apr 9 06:51:01 PDT 2014


On Mon, Mar 24, 2014 at 03:07:11PM +0400, Andrew Vagin wrote:
> > [adrian at dcbz ~]$ lsof -p 19148
> > COMMAND   PID   USER   FD      TYPE             DEVICE SIZE/OFF     NODE NAME
> > orterun 19148 adrian  cwd       DIR                8,6     4096  4276256 /home/adrian/devel/mpitest
> > orterun 19148 adrian  rtd       DIR                8,6     4096        2 /
> > orterun 19148 adrian  txt       REG                8,6   130809  4853952 /home/adrian/devel/openmpi-trunk/bin/orterun
> > orterun 19148 adrian  mem       REG                8,6    57976  6436012 /usr/lib64/libnss_files-2.18.so
> > orterun 19148 adrian  mem       REG                8,6    69712  6457664 /usr/lib64/libprotobuf-c.so.0.0.0
> > orterun 19148 adrian  mem       REG                8,6  2097264  6426159 /usr/lib64/libc-2.18.so
> > orterun 19148 adrian  mem       REG                8,6   147544  6439986 /usr/lib64/libpthread-2.18.so
> > orterun 19148 adrian  mem       REG                8,6  1159944  6435339 /usr/lib64/libm-2.18.so
> > orterun 19148 adrian  mem       REG                8,6    14608  6440555 /usr/lib64/libutil-2.18.so
> > orterun 19148 adrian  mem       REG                8,6   113320  6435410 /usr/lib64/libnsl-2.18.so
> > orterun 19148 adrian  mem       REG                8,6    44048  6440309 /usr/lib64/librt-2.18.so
> > orterun 19148 adrian  mem       REG                8,6    19512  6433536 /usr/lib64/libdl-2.18.so
> > orterun 19148 adrian  mem       REG                8,6    31832  6422555 /usr/lib64/libcriu.so.1.0
> > orterun 19148 adrian  mem       REG                8,6  2615952  4725410 /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0.0.0
> > orterun 19148 adrian  mem       REG                8,6  5260036  4726410 /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0.0.0
> > orterun 19148 adrian  mem       REG                8,6   154992  6422554 /usr/lib64/ld-2.18.so
> > orterun 19148 adrian    0u      CHR              136,6      0t0        9 /dev/pts/6
> > orterun 19148 adrian    1u      CHR              136,6      0t0        9 /dev/pts/6
> > orterun 19148 adrian    2u      CHR              136,6      0t0        9 /dev/pts/6
> > orterun 19148 adrian    3u     unix 0xffff8802106faa00      0t0 10591539 socket
> > orterun 19148 adrian    4u     unix 0xffff8802106fa680      0t0 10591540 socket
> > orterun 19148 adrian    5u  a_inode                0,9        0     7173 [eventfd]
> > orterun 19148 adrian    6u      REG               0,17        0 10591541 /dev/shm/open_mpi.0000 (deleted)
> > orterun 19148 adrian    7r     FIFO                0,8      0t0 10591543 pipe
> > orterun 19148 adrian    8w     FIFO                0,8      0t0 10591543 pipe
> > orterun 19148 adrian    9r      DIR                8,6     4096        2 /
> > orterun 19148 adrian   10u     IPv4           10590043      0t0      TCP *:53823 (LISTEN)
> > orterun 19148 adrian   11r     FIFO               0,32      0t0 10590050 /tmp/openmpi-sessions-adrian at dcbz_0/42683/0/debugger_attach_fifo
> > orterun 19148 adrian   12u      CHR                5,2      0t0     8661 /dev/ptmx
> > orterun 19148 adrian   13u     IPv4           10592260      0t0      TCP edur0000.hs-esslingen.de:53823->edur0000.hs-esslingen.de:47855 (ESTABLISHED)
> > orterun 19148 adrian   15w     FIFO                0,8      0t0 10590053 pipe
> > orterun 19148 adrian   16r     FIFO                0,8      0t0 10590054 pipe
> > orterun 19148 adrian   18r     FIFO                0,8      0t0 10590055 pipe
> > [adrian at dcbz ~]$ lsof -p 19151
> > COMMAND     PID   USER   FD      TYPE             DEVICE SIZE/OFF     NODE NAME
> > orte-test 19151 adrian  cwd       DIR                8,6     4096  4276256 /home/adrian/devel/mpitest
> > orte-test 19151 adrian  rtd       DIR                8,6     4096        2 /
> > orte-test 19151 adrian  txt       REG                8,6     8550  4241596 /home/adrian/devel/mpitest/orte-test2
> > orte-test 19151 adrian  mem       REG                8,6    57976  6436012 /usr/lib64/libnss_files-2.18.so
> > orte-test 19151 adrian  mem       REG                8,6    69712  6457664 /usr/lib64/libprotobuf-c.so.0.0.0
> > orte-test 19151 adrian  mem       REG                8,6  1159944  6435339 /usr/lib64/libm-2.18.so
> > orte-test 19151 adrian  mem       REG                8,6    14608  6440555 /usr/lib64/libutil-2.18.so
> > orte-test 19151 adrian  mem       REG                8,6   113320  6435410 /usr/lib64/libnsl-2.18.so
> > orte-test 19151 adrian  mem       REG                8,6    44048  6440309 /usr/lib64/librt-2.18.so
> > orte-test 19151 adrian  mem       REG                8,6    19512  6433536 /usr/lib64/libdl-2.18.so
> > orte-test 19151 adrian  mem       REG                8,6    31832  6422555 /usr/lib64/libcriu.so.1.0
> > orte-test 19151 adrian  mem       REG                8,6  2615952  4725410 /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0.0.0
> > orte-test 19151 adrian  mem       REG                8,6  5260036  4726410 /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0.0.0
> > orte-test 19151 adrian  mem       REG                8,6  2097264  6426159 /usr/lib64/libc-2.18.so
> > orte-test 19151 adrian  mem       REG                8,6   147544  6439986 /usr/lib64/libpthread-2.18.so
> > orte-test 19151 adrian  mem       REG                8,6 19630134  4725415 /home/adrian/devel/openmpi-trunk/lib/libmpi.so.0.0.0
> > orte-test 19151 adrian  mem       REG                8,6   154992  6422554 /usr/lib64/ld-2.18.so
> > orte-test 19151 adrian    0r     FIFO                0,8      0t0 10590053 pipe
> > orte-test 19151 adrian    1u      CHR             136,11      0t0       14 /dev/pts/11
> > orte-test 19151 adrian    2w     FIFO                0,8      0t0 10590054 pipe
> > orte-test 19151 adrian    3u     unix 0xffff8803ea4f7800      0t0 10590635 socket
> > orte-test 19151 adrian    4u     unix 0xffff8803ea4f7480      0t0 10590636 socket
> > orte-test 19151 adrian    5u  a_inode                0,9        0     7173 [eventfd]
> > orte-test 19151 adrian    6u      REG               0,17        0 10590637 /dev/shm/open_mpi.0000 (deleted)
> > orte-test 19151 adrian    7u     unix 0xffff8801a2044000      0t0 10590639 socket
> > orte-test 19151 adrian    8u     unix 0xffff8801a2046d80      0t0 10590640 socket
> > orte-test 19151 adrian    9u  a_inode                0,9        0     7173 [eventfd]
> > orte-test 19151 adrian   10u     IPv4           10590642      0t0      TCP *:38026 (LISTEN)
> > orte-test 19151 adrian   11u     IPv4           10591547      0t0      TCP edur0000.hs-esslingen.de:47855->edur0000.hs-esslingen.de:53823 (ESTABLISHED)
> > orte-test 19151 adrian   12u     IPv4           10590649      0t0      TCP *:1024 (LISTEN)
> > orte-test 19151 adrian   19w     FIFO                0,8      0t0 10590055 pipe
> 
> You can see that orterun and ort-test2 have tree common pipes. They are
> created by orterun. As I understand, orterun is not dumped, so
> these pipes are external resources for CRIU and we will need to write a
> plugin for restoring them.
> 
> I think the restore scheme should look like this:
> We run orterun, which prepare pipes and executes "CRIU restore".
> The OpenMPI plugin takes preparate pipes and restores them in a proper
> file descriptors.

It took me a while, but now I tried to restore a process from
orterun/mpirun with exec()ing 'criu restore' I still get this error:

(00.030652)   4277: tty: open type pts id 0x2 index 14 (master 0 sid 0 pgrp 0 inherit 1)
(00.030654)   4277: Error (tty.c:541): tty: Can't dup SELF_STDIN_OFF: Bad file descriptor
(00.031052) Error (cr-restore.c:1035): 4277 exited, status=255
(00.031072) Error (cr-restore.c:1577): Restoring FAILED.

You are talking about a plugin which restores the missing pipes. How
would such a plugin have to look like? Do you have any examples on
re-creating pipes in a plugin?

		Adrian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 811 bytes
Desc: not available
URL: <http://lists.openvz.org/pipermail/criu/attachments/20140409/b9bec29b/attachment-0001.sig>


More information about the CRIU mailing list