[CRIU] [OMPI devel] Open MPI and CRIU stdout/stderr

Wed Mar 19 06:32:00 PDT 2014

On 03/19/2014 05:25 PM, Jeff Squyres (jsquyres) wrote:
> On Mar 19, 2014, at 9:13 AM, Adrian Reber <adrian at lisas.de> wrote:
> 
>> What does Open MPI do with the file descriptors for stdout/stderr?
> 
> We admittedly do funny things with stdin, stdout, and stderr...  The short version is that OMPI intercepts all the stdin, stdout, and stderr from each MPI process and relays it back up to mpirun through our IOF subsystem (IOF = I/O forwarding).
> 
> Consider: users launch N processes (potentially on multiple different servers) via
> 
>    mpirun --hostfile hosts -np N my_mpi_executable
> 
> They also expect to be able to use standard shell redirection via the mpirun command.  For example:
> 
>    mpirun --hostfile hosts -np N my_mpi_executable |& tee out.txt
> 
> To explain what happens, we have to explain a little of how OMPI launches processes. Let's take the ssh case, for simplicity (there are other mechanisms it can use to launch on remote servers, but for the purposes of this discussion, they're basically variants of what happens with ssh).
> 
> 1. mpirun parses the hosts hostfile and extracts the list of servers on which to launch.
> 2. mpirun fork/execs an ssh command to each remote node, and launches the Open MPI helper daemon "orted"
> 3. The orted launches on the remote server, does some housekeeping, and eventually receives the launch command from mpirun
> 4. The launch command contains the executable and argv to fork/exec, and how many of them.  
> 5. For example: mpirun --hostfile hosts -np 4 my_mpi_executable.  If the "hosts" file contains serverA and serverB, then mpirun would launch 2 ssh's -- one each to serverA and serverB.  After some startup negotiation, mpirun would send a launch command telling the orted on each of serverA and serverB to launch 2 copies of my_mpi_executable.
> 6. For each child that the orted will create, it:
>    - creates (up to) 3 pipes, for: stdin, stdout, stderr
>    - forks
>    - closes stdin, stdout, stderr
>    - dups the pipes into 0, 1, 2
>    - (by default, we actually close stdin on all processes except the first one)
>    - execs my_mpi_application
> 7. In this way, the orted can intercept the stdout/stderr from the process and send it back to mpirun, which can then write it on its own stdout/stderr.  And therefore shell redirection from mpirun works as expected.
> 8. Similarly, the stdin from mpirun can be sent to any process where we kept stdin open (as mentioned above, by default, this is only the first process).
> 
> In short: the orted acts as a proxy for the stdout and stderr (and potentially stdin) for all launched processes.
> 
>> Would it make sense to close stdout/stderr of each checkpointed process
>> before checkpointing it?
> 
> Maybe...?
> 
> But my gut reaction is that you don't want to because of the "continue" case.  I.e., having the orted go through all the IOF setup again could be a bit tricky...  We didn't need to do this for other checkpointing systems.
> 

Adrian,

Can you show how the process tree looks like, what subtree you dump (and restore) and
where the mentioned pipes sit, so that we could decide how to dump them and how to
recreate them on restore.

I had an impression, that you dump the fork()-ed process, and it should have pipes in
its stdios, right?

Thanks,
Pavel