[CRIU] CRIU <---> OpenMPI
Pavel Emelyanov
xemul at parallels.com
Wed Nov 6 01:16:48 PST 2013
On 11/05/2013 02:16 AM, Jeff Squyres (jsquyres) wrote:
>> From: Pavel Emelyanov [mailto:xemul at parallels.com]
>>
>>> "Support one of MPI implementations - worth starting with OpenMPI"
>>> Is anyone actively working on this?
>>
>> Right now -- no :( At least in Parallels.
>
> Greetings. I'm one of the developers of Open MPI.
Cool! Nice to meet you :)
> FWIW: a user asked about CRIU integration on the Open MPI mailing lists a
> while ago, but I am not aware of anyone working on it.
Yes, I've seen this some time ago.
>> I've only heard from several people, that work with OpenMPI that they have plans to participate in CRIU development with this, but nothing more.
>
> Can I ask who you talked to?
Sure. These are guys from CompCenter. I've added Denis in Cc.
> As I said, I'm unaware of anyone working on CRIU integration with Open MPI, but my knowledge is certainly not absolute. :-)
>
>>> In the recent LinuxCon/Plumbers conferences Pavel mentioned a few
>>> times the HPC periodic snapshot use case. Is this use case related to
>>> the above TODO item?
>>
>> As I see it -- yes. When we do periodic snapshot we would have to take the MPI context with us. But how should it look -- is to be found out.
>
> Let me give a little background for checkpoint/restart support in Open MPI...
>
> Capturing the MPI state is pretty difficult, especially in the presents of hardware-offload networks.
> Meaning: the state is not completely discoverable in software; there's a non-trivial amount of state
> in hardware, too. (...skipping a much larger discussion about this kind of stuff; I can explain if
> anyone cares...)
>
> I forget what version introduced general checkpoint/restart support for parallel Open MPI jobs, but
> it's been available for quite a while.
Does this support include handling the hardware state you've mentioned above?
> Open MPI is fundamentally based on plugins, and support for the underlying checkpoint service is no
> exception. Open MPI has two existing plugins (with a third close to completion): for the BLCR
> checkpointer (http://crd.lbl.gov/groups-depts/ftg/projects/current-projects/BLCR), "self" (basically
> for applications that can do self-checkpointing), and DMTCP (http://dmtcp.sourceforge.net/).
>
> Adding support for CRIU to Open MPI is hypothetically pretty simple: just add a "crs" plugin for CRIU
> to Open MPI. The API that a CRS plugin has to support is relatively simple -- the main work is
> basically calling the underlying "checkpoint me now" functionality, and then providing a hook for when
> the process is restarted.
Is there any requirement for how the CRIU's API should look like to make this work smoothly?
Right now CRIU supports two APIs -- CLI and RPC service. Would any of that be suitable?
> To be clear: all the other infrastructure for saving and restoring the MPI state is already provided
> by Open MPI (even for hardware-offload networks). That infrastructure basically calls the CRS plugin
> to actually do the checkpoint, come back from the restore, ...and a few other miscellaneous things.
What's the way applications talk to the MPI hardware? I mean -- when we'll try to checkpoint
a process using CRIU, we can meet something, held by this task, that is unsupported by CRIU.
E.g. -- socket of unknown family, file descriptor for unknown device, memory mapping of
unsupported file, etc.
> All that being said, we did a fairly major architecture revamp in our SVN development trunk (and
> v1.7 release branch) recently for one of Open MPI's main subsystems, and the checkpoint/restart
> infrastructure is currently flat-out broken. :-( It's on the to-do list to fix, but it's going
> to take a little while.
>
> Hence, adding a new CRIU CRS plugin would need to be done on an older version of Open MPI -- th
> 1.6.x series. But the good news is that forward porting a CRIU CRS plugin from v1.6 to Open MPI's
> dev trunk/v1.7 branch is pretty straightforward (I'd even volunteer to help with that).
That's great! I'm ready to help with anything required on CRIU side to make this happen.
Thanks,
Pavel
More information about the CRIU
mailing list