[CRIU] CRIU <---> OpenMPI

Wed Nov 6 07:55:57 PST 2013

On Nov 6, 2013, at 1:16 AM, Pavel Emelyanov <xemul at parallels.com> wrote:

>> Can I ask who you talked to? 
> 
> Sure. These are guys from CompCenter. I've added Denis in Cc.

Greetings Denis.  Can you tell me the status of what you're working on w.r.t. CRIU and Open MPI?

>> I forget what version introduced general checkpoint/restart support for parallel Open MPI jobs, but 
>> it's been available for quite a while.  
> 
> Does this support include handling the hardware state you've mentioned above?

Yes.  What we actually do for those cases is shut down the network support (since there's no way for any software to capture the hardware state) before the checkpoint.  For example, with InifiniBand networks, we drain all network links, shut down all IB QPs, release all registered memory, etc.  When we're done, there are no kernel or hardware IB resources being used by the process, and therefore the process is checkpointable.

When the process resumes or restarts after checkpoint, all the IB state is built up again from scratch.

> Is there any requirement for how the CRIU's API should look like to make this work smoothly?
> Right now CRIU supports two APIs -- CLI and RPC service. Would any of that be suitable?

Some type of C library interface would be preferable -- i.e., some library/header file OMPI's configure script can look for; if it is found, OMPI will build the CRIU plugin.  If not found, the CRIU plugin will be skipped (this is how the vast majority of our plugins are built).  

How the library works under the covers doesn't matter (too) much, but two things would be nice:

1. Ability to query whether CRIU support is available in the current process (e.g., have the ability to ask "if I want to checkpoint later, can I?" -- i.e., if CRIU support is present and enabled in the kernel, etc.).

2. Not needing to fork/exec to run a CLI command would be desirable.

>> To be clear: all the other infrastructure for saving and restoring the MPI state is already provided 
>> by Open MPI (even for hardware-offload networks).  That infrastructure basically calls the CRS plugin 
>> to actually do the checkpoint, come back from the restore, ...and a few other miscellaneous things.
> 
> What's the way applications talk to the MPI hardware? I mean -- when we'll try to checkpoint
> a process using CRIU, we can meet something, held by this task, that is unsupported by CRIU.
> E.g. -- socket of unknown family, file descriptor for unknown device, memory mapping of 
> unsupported file, etc.

We try to close / shutdown all access to things that are difficult/impossible to checkpoint.  Even TCP sockets -- we'll drain them so that there's nothing in flight (which obviously can't be checkpointed).

This all applies to the MPI middleware layer, however.  If the application has something uncheckpointable open, we can't do much about that.  :-\

-- 
Jeff Squyres
jsquyres at cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/