[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps
Louis Rilling
Louis.Rilling at kerlabs.com
Thu Aug 7 02:25:28 PDT 2008
On Wed, Aug 06, 2008 at 08:41:10AM -0700, Joseph Ruscio wrote:
>
> On Aug 5, 2008, at 9:20 AM, Oren Laadan wrote:
>> Eh... and, yes, live migration :)
>
> User-space live migration of a "batch" process e.g. one taking place in
> an MPI job is quite trivial. User-space live migration of something like
> a database is not that hard assuming you have a cooperative load
> balancer or proxy on the front end.
Hm, this means modifying the MPI run-time, right? Especially the ones relying on
daemons on each node (like LAM implementation, and MPI2 specification IIRC).
Anyway, this is probably not an issue, since most high-end HPC systems come with
their own customized MPI implementation.
>
> I'm not advocating for implementing this in user-space. I am in complete
> agreement that this effort should result in code that completely
> checkpoints a Container in the kernel. My question was whether there are
> situations where it would be advantageous for user-space to have the
> option of instructing/hinting the kernel to ignore certain resources that
> it would handle itself. Most of the use-cases I'm thinking of come from
> the different styles of implementations I've seen in the HPC space, where
> our implementation (and a lot of others) are focused.
>
> MPI codes require coordination between all the different processes
> taking part to ensure that the checkpoints are globally consistent. MPI
> implementations that run on hardware such as Infiniband would most
> likely want the container checkpointing to ignore all of the pinned
> memory associated with the RDMA operations so that the coordination and
> recreation of MPI communicator state could be handled in user-space. When
> working with inflexible process checkpointers, MPI coordination routines
> often must completely teardown all communicator state prior to invoking
> the checkpoint, and then recreate all the communicators after the
> checkpoint. On very large scale jobs, this is expensive.
>
> As another example HPC applications can create local scratch files of
> several GB in /tmp. It may not be necessary to migrate these files, but
> if user-space has no way to mark a particular file, "local files", or
> files in general as being ignored, then we'll have to copy these during a
> migration or a checkpoint.
Definitely agree with you here. This is the kind of use-case we will study in
Kerrighed. (Actually the project is centered on supporting a petaflopic
application, with help from Kerrighed to tolerate failures).
>
> I don't suppose anyone is attending Linuxworld in San Francisco this
> week? I'd be more then happy to grab a coffee and talk about some of
> this. I stopped by the OpenVZ booth but none of the devs are around.
Not me, sorry :) However, whichever requirement you can describe is interesting
for us. They can surely help designing a most useful checkpoint/restart
mechanism.
Thanks,
Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.openvz.org/pipermail/devel/attachments/20080807/bfcf5f4f/attachment-0001.sig>
-------------- next part --------------
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list