[CRIU] CRIU restart problem on slurm with multinode executions
Klodjan Hidri
klodi_hidri at hotmail.com
Wed Jul 6 19:48:05 MSK 2022
Hello
I need some help with criu on restoring BT application from NAS benchmark.
I have four nodes on my cluster and I have installed four Docker
containers
one container on each node. Inside the container I have installed Slurm,
OpenMPI, CRIU
with centos 7. The container are connected with TCP/IP ethernet , not
with infiniband or RDMA.
When I run openmpi without slurm the checkpoint/restore works
perfectly on distributed(with 4 nodes).
When I run openmpi with slurm on single node the checkpoint/restore
works perfectly also,
but if I run openmpi with slurm on multinode I have problems on restore.
It can't restore the
application. The error is:
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: hidri-node01
Local PID: 13316
Peer host: hidri-node02
I dump with this :
sudo /opt/mounts/build/criu-3.17.1/criu/criu --display-stats dump
--network-lock iptables --tcp-established --ghost-limit 10000M -t
$(pidof mpirun) -D checkpoint/ --shell-job -v4
I restore with this
sudo /opt/mounts/build/criu-3.17.1/criu/criu --display-stats restore
--tcp-established --ghost-limit 10000M -D checkpoint/ --shell-job -v4
Can you help me how to find a solution on restoring with slurm on
multi-node runs ??
Thank you
Klodjan Hidri - researcher @ Institute of Computer Science ITE-FORTH
http://archvlsi.ics.forth.gr/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20220706/743a24b0/attachment-0001.html>
More information about the CRIU
mailing list