[CRIU] CRIU <---> OpenMPI

Wed Nov 6 10:49:05 PST 2013

On Nov 6, 2013, at 8:57 AM, Pavel Emelyanov <xemul at parallels.com> wrote:

>> Some type of C library interface would be preferable -- i.e., some library/header file OMPI's configure script 
>> can look for; if it is found, OMPI will build the CRIU plugin.  If not found, the CRIU plugin will be skipped
>> (this is how the vast majority of our plugins are built).  
> 
> I see. 
> 
> Well, making the whole CRIU available as library won't work. This will limit its usage only to programs running as root and written on C/C++. Would a library that is just a wrapper around RPC client be suitable? The libvirt people are also requesting for library, I'm now trying to figure out would such library suit them too.

That's not a problem.

To clarify: I wasn't asking for all of CRIU to be a library -- just userspace access to it.  So if this library uses some type of IPC under the covers (e.g., RPC) to access the real back-end CRIU functionality, that's fine from my perspective.  

Indeed, providing a userspace library to do this stuff lets you hide whatever flavor of IPC you want to use (and even change it someday without creating any backwards compatibility issues).

>> 1. Ability to query whether CRIU support is available in the current process (e.g., have the ability to
>> ask "if I want to checkpoint later, can I?" -- i.e., if CRIU support is present and enabled in the kernel, etc.).
> 
> Hm... But the answer "yes" at any given time wouldn't guarantee, that the task is checkpoint-able
> some time later. Is that OK?

Sounds perfect.  

All we essentially need to know is: is the system setup to enable CRIU checkpointing such that if we invoke a CRIU checkpoint on this process later, it'll *probably* succeed... with all appropriate disclaimers about the app only having "checkpointable" state when the checkpoint is actually invoked, yadda yadda yadda.

The reason we want this is because when an Open MPI job launches, it basically opens all available OMPI plugins and queries them, asking "do you want to run in this job?".  Plugins that provide services that are not enabled on the system will say "no", and then quietly allow themselves to be dlclosed and excluded from the rest of the run.  

Meaning: even if the OMPI CRIU plugin is built/installed, a given MPI process may or may not be running on a CRIU-enabled server.  Hence, the CRIU plugin will query the CRIU library when a job starts to see if it's running on CRIU-enabled machine.  If CRIU support is not available on this machine, it'll tell the OMPI core "nope, I don't want to run", and then be dlclosed.

One sidenote: we do these kinds of queries in *every* MPI process.  The common usage pattern is to run one MPI process per physical core on a server.  High-end Intel Ivy Bridge processors, for example, have 10 cores per socket.  So a user might start 10, 20, or 40 MPI processes on a single server (if they have 1, 2, or 4 sockets, respectively).  My point: the query method should be scalable enough to handle a bunch of simultaneous queries on a single server quickly.  I don't know offhand how RPC to localhost is implemented, but if it's lossy (e.g., RPC requests may get dropped and have to try again), that may or may not be scalable enough for large-core-count servers in the MPI use case.

> Cool. So, provided the above suggestion about RPC client wrapper is OK, on CRIU side we'd need
> two things:
> 
> 1. the library itself
> 2. ability to check "can I dump $pid"

Yep!

-- 
Jeff Squyres
jsquyres at cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/