<html data-lt-installed="true"><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>Hello <br>
I need some help with criu on restoring BT application from NAS
benchmark.<br>
</p>
<p>I have four nodes on my cluster and I have installed four
Docker containers<br>
one container on each node. Inside the container I have installed
Slurm, OpenMPI, CRIU<br>
with centos 7. The container are connected with TCP/IP ethernet ,
not with infiniband or RDMA. <br>
When I run openmpi without slurm the checkpoint/restore works
perfectly on distributed(with 4 nodes).<br>
When I run openmpi with slurm on single node the
checkpoint/restore works perfectly also,<br>
but if I run openmpi with slurm on multinode I have problems on
restore. It can't restore the <br>
application. The error is:<br>
<span style="font-family:monospace"><span style="color:#000000;background-color:#ffffff;">An MPI
communication peer process has unexpectedly disconnected.
This
</span><br>
usually indicates a failure in the peer process (e.g., a crash
or
<br>
otherwise exiting without calling MPI_FINALIZE first).
<br>
<br>
Although this local MPI process will likely now behave
unpredictably
<br>
(it may even hang or crash), the root cause of this problem is
the
<br>
failure of the peer -- that is what you need to investigate.
For
<br>
example, there may be a core file that you can examine. More
<br>
generally: such peer hangups are frequently caused by
application bugs
<br>
or other external events.
<br>
<br>
Local host: hidri-node01
<br>
Local PID: 13316
<br>
Peer host: hidri-node02<br>
<br>
</span></p>
<p>I dump with this :<br>
<span style="font-family:monospace"></span></p>
<p><span style="font-family:monospace"><span style="color:#000000;background-color:#ffffff;">sudo
/opt/mounts/build/criu-3.17.1/criu/criu --display-stats
dump --network-lock iptables --tcp-established --ghost-limit
10000M -t $(pidof mpirun) -D checkpoint/ --shell-job -v4</span><br>
</span></p>
<p>I restore with this <br>
</p>
<p><span style="font-family:monospace"><span style="color:#000000;background-color:#ffffff;">sudo
/opt/mounts/build/criu-3.17.1/criu/criu --display-stats
restore --tcp-established --ghost-limit 10000M -D checkpoint/
--shell-job -v4</span><br>
</span><br>
<span style="font-family:monospace"></span> Can you help me how to
find a solution on restoring with slurm on multi-node runs ??<br>
</p>
<p>Thank you <br>
Klodjan Hidri - researcher @ Institute of Computer Science
ITE-FORTH <a class="moz-txt-link-freetext" href="http://archvlsi.ics.forth.gr/">http://archvlsi.ics.forth.gr/</a><br>
</p>
</body>
<lt-container></lt-container>
</html>