[CRIU] CRIU and Open MPI

Pavel Emelyanov xemul at parallels.com
Tue Feb 18 07:32:42 PST 2014


On 02/18/2014 05:51 PM, Adrian Reber wrote:
> I have integrated CRIU into Open MPI so far that I can now checkpoint
> Open MPI processes.

Cool! Thanks, Adrian!

> There are a few problems I would like to mention here, hoping they can be fixed.
> 
> In the current setup I have a test case running under Open MPI and I am
> trying to checkpoint it. Following works:
> 
> /path/to/orterun --np 1 --mca oob tcp orte-test
> 
> and then I am signaling Open MPI to checkpoint its processes with:
> 
> /path/to/orte-checkpoint `pidof orterun`
> 
> 
> orterun receives the information and checkpoints 'test-case' using CRIU
> successfully. '--np 1' means to start one instance of 'orte-test' and
> '--mca oob tcp' specifies to use TCP for communication.
> 
> Using '--np 2' does not work. With '--np 2' 'orte-test' is started two
> times and I see following in the protocol file:
> 
> (00.033983) Dumping mappings (pid: 14831)
> (00.033985) ----------------------------------------
> (00.033988) 0x400000-0x401000 (4K) prot 0x5 flags 0x2 off 0 reg fp  shmid: 0
> (00.034004) Dumping path for -3 fd via self 32 [/home/adrian/devel/mpitest/orte-test]
> (00.034024) 0x600000-0x601000 (4K) prot 0x1 flags 0x2 off 0 reg fp  shmid: 0
> (00.034032) Dumping path for -3 fd via self 33 [/home/adrian/devel/mpitest/orte-test]
> (00.034044) 0x601000-0x602000 (4K) prot 0x3 flags 0x2 off 0x1000 reg fp  shmid: 0
> (00.034053) Dumping path for -3 fd via self 34 [/home/adrian/devel/mpitest/orte-test]
> (00.034065) 0x602000-0x623000 (132K) prot 0x3 flags 0x22 off 0 reg heap ap  shmid: 0
> (00.034072) 0x623000-0x635000 (72K) prot 0x3 flags 0x22 off 0 reg heap ap  shmid: 0
> (00.034079) 0x635000-0x747000 (1096K) prot 0x3 flags 0x22 off 0 reg heap ap  shmid: 0
> (00.034085) 0x747000-0x787000 (256K) prot 0x3 flags 0x22 off 0 reg heap ap  shmid: 0
> (00.034092) 0x787000-0x888000 (1028K) prot 0x3 flags 0x22 off 0 reg heap ap  shmid: 0
> (00.034098) 0x888000-0x8a9000 (132K) prot 0x3 flags 0x22 off 0 reg heap ap  shmid: 0
> (00.034105) 0x7fffe3fff000-0x7fffe8000000 (65540K) prot 0x3 flags 0x1 off 0 reg fs  shmid: 0
> (00.034116) Dumping path for -3 fd via self 35 [/tmp/openmpi-sessions-adrian at dcbz_0/54685/1/shared_mem_pool.dcbz (deleted)]
> (00.034120) Dumping ghost file for fd 35 id 0x17
> (00.034123) Error (files-reg.c:305): Can't dump ghost file /tmp/openmpi-sessions-adrian at dcbz_0/54685/1/shared_mem_pool.dcbz (deleted) of 67108872 size

This error means, that there's a file opened and unlinked and it weights 67Mb.
Such files (opened and unlinked) should be copied to the images dir, since they
will be removed from disk once tasks we dump die.

I've put a ... magic constant :) limiting the size of such files to a couple of
megs just not to put too big files into images, since such copying will take time.

> (00.034126) Error (cr-dump.c:1561): Dump mappings (pid: 14831) failed with -1
> (00.034220) Unlock network
> (00.034229) 	Running iptables [iptables -t filter -D INPUT --protocol tcp --source 134.108.261.222 --sport 48956 --destination 134.108.261.222 --dport 55654 -j DROP]
> (00.037627) Unlocked 134.108.261.222:55654 - 134.108.261.222:48956 connection
> (00.037643) 	Running iptables [iptables -t filter -D OUTPUT --protocol tcp --source 134.108.261.222 --sport 55654 --destination 134.108.261.222 --dport 48956 -j DROP]
> (00.039822) Unlocked 134.108.261.222:48956 - 134.108.261.222:55654 connection
> (00.039877) Unfreezing tasks into 1
> (00.039882) 	Unseizing 14831 into 1
> (00.039929) Error (cr-dump.c:1828): Dumping FAILED.
> 
> 
> 
> Not using '--mca oob tcp' makes Open MPI use unix sockets for internal
> communication. This fails with:
> 
> (00.020013) Dumping path for -3 fd via self 31 [/home/adrian/devel/mpitest]
> (00.020026) Dumping path for -3 fd via self 31 [/]
> (00.020039) Dumping task cwd id 0x4e root id 0x4f
> (00.020086) Dump shared signals of 15002
> (00.020104) Dump private signals of 15002
> (00.020118) Dump private signals of 15003
> (00.020131) Dump private signals of 15004
> (00.020177) 
> (00.020180) Dumping pstree (pid: 15002)
> (00.020181) ----------------------------------------
> (00.020195) Process: 15002(15002)
> (00.020200) ----------------------------------------
> (00.020203) Dumping external sockets
> (00.020205) 	Dumping extern: ino 0x609e69 peer_ino 0x609791 family    1 type    1 state  1 name /tmp/openmpi-sessions-adrian at dcbz_0/55010/0/usock
> (00.020213) 	Dumped extern: id 0x50 ino 0x609e69 peer 0 type 2 state 10 name 50 bytes
> (00.020215) 	Ext stream not supported: ino 0x609e69 peer_ino 0x609791 family    1 type    1 state  1 name /tmp/openmpi-sessions-adrian at dcbz_0/55010/0/usock

It's a stream unix socket from the program you dump to some other task.
What is this socket about?

> (00.020217) Error (sk-unix.c:543): Can't dump half of stream unix connection.
> (00.020233) Unlock network
> (00.020237) Unfreezing tasks into 1
> (00.020239) 	Unseizing 15002 into 1
> (00.020267) Error (cr-dump.c:1828): Dumping FAILED.
> 
> 
> Are those two errors fixable in CRIU or do I have to fix it in Open MPI.

The error with big ghost file needs research. Where did this file come from?

I'm not sure that fixing this by increasing the ghost file size limit is good
idea. Copying 67 megs during dump will slow things down :(

> The unix sockets error is probably fixable in Open MPI by shutting
> down this communication completely before doing the checkpoint.

The socket can be fixed by providing a plugin [1] that would handle the openmpi
socket properly. We will definitely help with any info required to implement it :)

> 		Adrian

Thanks,
Pavel


[1] http://criu.org/Plugins


More information about the CRIU mailing list