[CRIU] CRIU <---> OpenMPI

Pavel Emelyanov xemul at parallels.com
Wed Nov 6 08:57:43 PST 2013


On 11/06/2013 07:55 PM, Jeff Squyres (jsquyres) wrote:
> On Nov 6, 2013, at 1:16 AM, Pavel Emelyanov <xemul at parallels.com> wrote:
> 
>>> Can I ask who you talked to? 
>>
>> Sure. These are guys from CompCenter. I've added Denis in Cc.
> 
> Greetings Denis.  Can you tell me the status of what you're working on w.r.t. CRIU and Open MPI?
> 
>>> I forget what version introduced general checkpoint/restart support for parallel Open MPI jobs, but 
>>> it's been available for quite a while.  
>>
>> Does this support include handling the hardware state you've mentioned above?
> 
> Yes.  What we actually do for those cases is shut down the network support (since there's no way for any software to capture the hardware state) before the checkpoint.  For example, with InifiniBand networks, we drain all network links, shut down all IB QPs, release all registered memory, etc.  When we're done, there are no kernel or hardware IB resources being used by the process, and therefore the process is checkpointable.
> 
> When the process resumes or restarts after checkpoint, all the IB state is built up again from scratch.

OK.

>> Is there any requirement for how the CRIU's API should look like to make this work smoothly?
>> Right now CRIU supports two APIs -- CLI and RPC service. Would any of that be suitable?
> 
> Some type of C library interface would be preferable -- i.e., some library/header file OMPI's configure script 
> can look for; if it is found, OMPI will build the CRIU plugin.  If not found, the CRIU plugin will be skipped
> (this is how the vast majority of our plugins are built).  

I see. 

Well, making the whole CRIU available as library won't work. This will limit its usage only
to programs running as root and written on C/C++. Would a library that is just a wrapper
around RPC client be suitable? The libvirt people are also requesting for library, I'm now
trying to figure out would such library suit them too.

> How the library works under the covers doesn't matter (too) much, but two things would be nice:
> 
> 1. Ability to query whether CRIU support is available in the current process (e.g., have the ability to
>  ask "if I want to checkpoint later, can I?" -- i.e., if CRIU support is present and enabled in the kernel, etc.).

Hm... But the answer "yes" at any given time wouldn't guarantee, that the task is checkpoint-able
some time later. Is that OK?

> 2. Not needing to fork/exec to run a CLI command would be desirable.

Wrapper over RPC client would be such.

>>> To be clear: all the other infrastructure for saving and restoring the MPI state is already provided 
>>> by Open MPI (even for hardware-offload networks).  That infrastructure basically calls the CRS plugin 
>>> to actually do the checkpoint, come back from the restore, ...and a few other miscellaneous things.
>>
>> What's the way applications talk to the MPI hardware? I mean -- when we'll try to checkpoint
>> a process using CRIU, we can meet something, held by this task, that is unsupported by CRIU.
>> E.g. -- socket of unknown family, file descriptor for unknown device, memory mapping of 
>> unsupported file, etc.
> 
> We try to close / shutdown all access to things that are difficult/impossible to checkpoint.  Even TCP sockets -- we'll drain them so that there's nothing in flight (which obviously can't be checkpointed).
> 
> This all applies to the MPI middleware layer, however.  If the application has something uncheckpointable open, we can't do much about that.  :-\

Cool. So, provided the above suggestion about RPC client wrapper is OK, on CRIU side we'd need
two things:

1. the library itself
2. ability to check "can I dump $pid"

Thanks,
Pavel


More information about the CRIU mailing list