[CRIU] CRIU restart problem on slurm with multinode executions

Thu Jul 7 17:20:11 MSK 2022

Please submit your question as a github issue. Also include the
checkpoint/restore logs. If the restore, however, is successful we
cannot really help you if it fails in OpenMPI. We do not know much about
OpenMPI.

		Adrian

On Wed, Jul 06, 2022 at 07:48:05PM +0300, Klodjan Hidri wrote:
> Hello
> I need some help with criu on restoring  BT application from NAS benchmark.
> 
> I have four nodes on my cluster  and I have installed four Docker 
> containers
> one container on each node. Inside the container I have installed Slurm,
> OpenMPI, CRIU
> with centos 7. The container are connected with TCP/IP ethernet , not with
> infiniband or RDMA.
> When I run  openmpi without slurm  the checkpoint/restore works perfectly on
> distributed(with 4 nodes).
> When I run openmpi with slurm on single node the checkpoint/restore works
> perfectly also,
> but if I run openmpi with slurm on multinode I have problems on restore. It
> can't restore the
> application. The error is:
> An MPI communication peer process has unexpectedly disconnected.  This
> usually indicates a failure in the peer process (e.g., a crash or
> otherwise exiting without calling MPI_FINALIZE first).
> 
> Although this local MPI process will likely now behave unpredictably
> (it may even hang or crash), the root cause of this problem is the
> failure of the peer -- that is what you need to investigate.  For
> example, there may be a core file that you can examine.  More
> generally: such peer hangups are frequently caused by application bugs
> or other external events.
> 
>  Local host: hidri-node01
>  Local PID:  13316
>  Peer host:  hidri-node02
> 
> I dump with this :
> 
> sudo  /opt/mounts/build/criu-3.17.1/criu/criu   --display-stats dump
> --network-lock iptables  --tcp-established  --ghost-limit 10000M -t $(pidof
> mpirun)  -D checkpoint/ --shell-job -v4
> 
> I restore with this
> 
> sudo  /opt/mounts/build/criu-3.17.1/criu/criu --display-stats restore
> --tcp-established --ghost-limit 10000M -D checkpoint/ --shell-job  -v4
> 
> Can you help me how to find  a solution on restoring with slurm on
> multi-node runs ??
> 
> Thank you
> Klodjan Hidri - researcher @  Institute of Computer Science ITE-FORTH
> http://archvlsi.ics.forth.gr/

> _______________________________________________
> CRIU mailing list
> CRIU at openvz.org
> https://lists.openvz.org/mailman/listinfo/criu

		Adrian

-- 
Adrian Reber <adrian at lisas.de>            http://lisas.de/~adrian/
routing problems on the neural net