[CRIU] CRIU restart problem on slurm with multinode executions

Klodjan Hidri klodi_hidri at hotmail.com
Wed Jul 6 19:48:05 MSK 2022


Hello
I need some help with criu on restoring  BT application from NAS benchmark.

I have four nodes on my cluster  and I have installed four Docker  
containers
one container on each node. Inside the container I have installed Slurm, 
OpenMPI, CRIU
with centos 7. The container are connected with TCP/IP ethernet , not 
with infiniband or RDMA.
When I run  openmpi without slurm  the checkpoint/restore works 
perfectly on distributed(with 4 nodes).
When I run openmpi with slurm on single node the checkpoint/restore 
works perfectly also,
but if I run openmpi with slurm on multinode I have problems on restore. 
It can't restore the
application. The error is:
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: hidri-node01
  Local PID:  13316
  Peer host:  hidri-node02

I dump with this :

sudo  /opt/mounts/build/criu-3.17.1/criu/criu   --display-stats dump 
--network-lock iptables  --tcp-established  --ghost-limit 10000M -t 
$(pidof mpirun)  -D checkpoint/ --shell-job -v4

I restore with this

sudo  /opt/mounts/build/criu-3.17.1/criu/criu --display-stats restore 
--tcp-established --ghost-limit 10000M -D checkpoint/ --shell-job  -v4

Can you help me how to find  a solution on restoring with slurm on 
multi-node runs ??

Thank you
Klodjan Hidri - researcher @  Institute of Computer Science ITE-FORTH 
http://archvlsi.ics.forth.gr/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20220706/743a24b0/attachment-0001.html>


More information about the CRIU mailing list