<html data-lt-installed="true"><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>Hello <br>

      I need some help with criu on restoring&nbsp; BT application from NAS

      benchmark.<br>

    </p>

    <p>I have four nodes on my cluster&nbsp; and I have installed four

      Docker&nbsp; containers<br>

      one container on each node. Inside the container I have installed

      Slurm, OpenMPI, CRIU<br>

      with centos 7. The container are connected with TCP/IP ethernet ,

      not with infiniband or RDMA. <br>

      When I run&nbsp; openmpi without slurm&nbsp; the checkpoint/restore works

      perfectly on distributed(with 4 nodes).<br>

      When I run openmpi with slurm on single node the

      checkpoint/restore works perfectly also,<br>

      but if I run openmpi with slurm on multinode I have problems on

      restore. It can't restore the <br>

      application. The error is:<br>

      <span style="font-family:monospace"><span style="color:#000000;background-color:#ffffff;">An MPI

          communication peer process has unexpectedly disconnected.

          &nbsp;This

        </span><br>

        usually indicates a failure in the peer process (e.g., a crash

        or

        <br>

        otherwise exiting without calling MPI_FINALIZE first).

        <br>

        <br>

        Although this local MPI process will likely now behave

        unpredictably

        <br>

        (it may even hang or crash), the root cause of this problem is

        the

        <br>

        failure of the peer -- that is what you need to investigate.

        &nbsp;For

        <br>

        example, there may be a core file that you can examine. &nbsp;More

        <br>

        generally: such peer hangups are frequently caused by

        application bugs

        <br>

        or other external events.

        <br>

        <br>

        &nbsp;Local host: hidri-node01

        <br>

        &nbsp;Local PID: &nbsp;13316

        <br>

        &nbsp;Peer host: &nbsp;hidri-node02<br>

        <br>

      </span></p>

    <p>I dump with this :<br>

      <span style="font-family:monospace"></span></p>

    <p><span style="font-family:monospace"><span style="color:#000000;background-color:#ffffff;">sudo

          &nbsp;/opt/mounts/build/criu-3.17.1/criu/criu &nbsp;&nbsp;--display-stats

          dump --network-lock iptables &nbsp;--tcp-established &nbsp;--ghost-limit

          10000M -t $(pidof mpirun) &nbsp;-D checkpoint/ --shell-job -v4</span><br>

      </span></p>

    <p>I restore with this <br>

    </p>

    <p><span style="font-family:monospace"><span style="color:#000000;background-color:#ffffff;">sudo

          &nbsp;/opt/mounts/build/criu-3.17.1/criu/criu --display-stats

          restore --tcp-established --ghost-limit 10000M -D checkpoint/

          --shell-job&nbsp; -v4</span><br>

      </span><br>

      <span style="font-family:monospace"></span> Can you help me how to

      find&nbsp; a solution on restoring with slurm on multi-node runs ??<br>

    </p>

    <p>Thank you <br>

      Klodjan Hidri - researcher @&nbsp; Institute of Computer Science

      ITE-FORTH &nbsp; <a class="moz-txt-link-freetext" href="http://archvlsi.ics.forth.gr/">http://archvlsi.ics.forth.gr/</a><br>

    </p>

  </body>

  <lt-container></lt-container>

</html>