[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps
Joseph Ruscio
jruscio at evergrid.com
Wed Aug 6 08:41:10 PDT 2008
On Aug 5, 2008, at 9:20 AM, Oren Laadan wrote:
>
>
> Louis Rilling wrote:
>> On Mon, Aug 04, 2008 at 08:51:37PM -0700, Joseph Ruscio wrote:
>>> As somewhat of a tangent to this discussion, I've been giving some
>>> thought to the general strategy we talked about during the summit.
>>> The
>>> checkpointing solution we built at Evergrid sits completely in
>>> userspace
>>> and is soley focused on checkpointing parallel codes (e.g. MPI).
>>> That
>>> approach required us to virtualize a whole slew of resources (e.g.
>>> PIDs)
>>> that will be far better supported in the kernel through this
>>> effort. On
>>> the other hand, there isn't anything inherent to checkpointing the
>>> memory
>>> in a process that requires it to be in a kernel. During a restart,
>>> you
>>> can map and load the memory from the checkpoint file in userspace as
>>> easily as in the kernel. Since the cost of checkpointing HPC codes
>>> is
>>
>> Hmm, for unusual mappings this may be not so easy to reproduce from
>> userspace if binaries are statically linked. I agree that with
>> dynamically linked applications, LD_PRELOAD allows one to record the
>> actual memory mappings and restore them at restart.
>
> I second that: unusual mapping can be hard to reproduce.
>
> Besides, several important optimization are difficult to do in user-
> space,
> if at all possible:
>
> * detecting sharing (unless the application itself gives the OS an
> advice -
> more on this below); In the kernel, this is detected easily using
> the inode
> that represents a shared memory region in SHMFS
>
>
> * detecting (and restoring) COW sharing: process A forks process B,
> so at
> least initially the private memory of both is the same via COW; this
> can be
> optimized to save the memory of only one instead of both, and
> restore this
> COW relationship on restart.
Both of these are possible from userspace, but agreeably more
complicated. Also agree that statically linked binaries are not really
feasible in user-space.
> * reducing checkpoint downtime using the COW technique that I
> described at
> the summit: when processes are frozen, mark all dirty pages COW and
> keep a
> reference, and write-back the contents only after the container is
> unfrozen.
Our user-space implementation already has a complete concurrent (i.e.
COW) checkpointing implementation where the "freeze" period lasts only
the length of time it takes to mprotect() the allocated memory
regions. So I don't necessarily agree that these optimizations require
kernel access.
> Eh... and, yes, live migration :)
User-space live migration of a "batch" process e.g. one taking place
in an MPI job is quite trivial. User-space live migration of something
like a database is not that hard assuming you have a cooperative load
balancer or proxy on the front end.
I'm not advocating for implementing this in user-space. I am in
complete agreement that this effort should result in code that
completely checkpoints a Container in the kernel. My question was
whether there are situations where it would be advantageous for user-
space to have the option of instructing/hinting the kernel to ignore
certain resources that it would handle itself. Most of the use-cases
I'm thinking of come from the different styles of implementations I've
seen in the HPC space, where our implementation (and a lot of others)
are focused.
MPI codes require coordination between all the different processes
taking part to ensure that the checkpoints are globally consistent.
MPI implementations that run on hardware such as Infiniband would most
likely want the container checkpointing to ignore all of the pinned
memory associated with the RDMA operations so that the coordination
and recreation of MPI communicator state could be handled in user-
space. When working with inflexible process checkpointers, MPI
coordination routines often must completely teardown all communicator
state prior to invoking the checkpoint, and then recreate all the
communicators after the checkpoint. On very large scale jobs, this is
expensive.
As another example HPC applications can create local scratch files of
several GB in /tmp. It may not be necessary to migrate these files,
but if user-space has no way to mark a particular file, "local files",
or files in general as being ignored, then we'll have to copy these
during a migration or a checkpoint.
I don't suppose anyone is attending Linuxworld in San Francisco this
week? I'd be more then happy to grab a coffee and talk about some of
this. I stopped by the OpenVZ booth but none of the devs are around.
thanks,
Joe
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list