[Devel] Re: C/R: File substitution at restart

Louis Rilling Louis.Rilling at kerlabs.com
Thu Sep 9 04:34:11 PDT 2010


On 09/09/10  4:02 -0700, Matt Helsley wrote:
> On Thu, Sep 09, 2010 at 12:37:20PM +0200, Louis Rilling wrote:
> > On 08/09/10 21:06 -0700, Matt Helsley wrote:
> > > On Wed, Sep 08, 2010 at 08:03:52PM -0500, Serge E. Hallyn wrote:
> > > > Quoting Matt Helsley (matthltc at us.ibm.com):
> > > > > On Wed, Sep 08, 2010 at 08:09:31AM -0500, Serge E. Hallyn wrote:
> > > > > I think it can be split into two composable pieces which may also be
> > > > > useful independently.
> > > > > 
> > > > > The first uses the fcntl() interface to add a flag like
> > > > > O_CLOEXEC. Unlike O_CLOEXEC it marks an fd for preservation during
> > > > > restart. That way we don't have to specify an fd number and a "source"
> > > > > to the kernel. Just tell the kernel to keep the fd. The source can
> > > > > be opened and dup2'd via userspace. This is useful without the
> > > > > second piece if we want to simply add rather than replace an fd.
> > > > 
> > > > Can you think of any other use for this flag other than restart?
> > > 
> > > <joking>
> > > I can't think of any other uses for O_CLOEXEC.
> > > </joking>
> > > 
> > > Seriously though, restart will be used _much_ less often than exec so yes
> > > it does seem like a waste of a valuable bit and something that wouldn't
> > > quite belong in an fcntl interface.
> > > 
> > > However we can try to be a tad clever -- we could (ab|re)use O_CLOEXEC.
> > > Right now restart closes all file descriptors and pays absolutely
> > > no attention to O_CLOEXEC. We could reuse O_CLOEXEC to mean O_CLOREST
> > > too. Have user-cr's restart tool mark all unwanted fds O_CLOEXEC. Any we
> > > want to keep we do not mark with O_CLOEXEC.
> > 
> > This would also be useful at checkpoint, to tell sys_checkpoint() which fds
> > should be ignored, being because it is not supported or because the application
> > has a better way to deal with it.
> 
> True. Though unlike restart I don't think we just can (ab|re)use O_CLOEXEC
> for that purpose.
> 
> > 
> > > 
> > > 
> > > Here's another idea which I haven't fully thought out yet.
> > > 
> > > We could introduce the concept of object id substitutions in the image.
> > > So the image would look like (going from file pos 0 at the top..):
> > > 
> > > 0 +-------------------------------+
> > >   |                               |
> > >                 .....
> > >   +-------------------------------+
> > >   |     <substitute object>       | <--- object with id == <substitute id>
> > >                 .....
> > >   +---------------+---------------+
> > >   |  <object id>  |<substitute id>|
> > >   +---------------+---------------+
> > >                 .....
> > >   +---------------+---------------+
> > >   |     <object to ignore>        | <-- object with id == <object id>
> > >                 .....
> > > 
> > > (The above is ignoring the ckpt_hdr fields..)
> > > 
> > > When we read the image during restart we use the substitute ids to
> > > create indirect objhash entries. When we encounter an obj id and
> > > it refers to an indirect entry we first parse the object (ignoring
> > > errors and dropping references on new objhash insertions), flip
> > > a bit on the indirect entry (indicating the object has been parsed),
> > > and then lookup the substitute id and return whatever that resolved to.
> > > 
> > > We can ignore the new objhash objects by making the objhash have its
> > > own operation struct. When we're parsing an object that's been
> > > substituted we just temporarily set the objhash add/lookup operations
> > > to something suitable for properly dropping references to the new
> > > object(s). This way we don't have to add checks for this peculiar
> > > need all over the checkpoint/restart code. Sure it'll be slower...
> > 
> > If at checkpoint we can take care to ignore files that we know will be
> > substituted, this should not be that slower.
> 
> So, would you say typically it's the application developer who knows
> what to ignore? Are we expecting distros/packagers to be able to set
> that up? Admins? These specific optimizations seem like they would be a
> bit fragile unless the application developer is involved.

If you look at OpenMPI's C/R framework, the policy to ignore/substitude fds
(mostly sockets and log files) is programmed in the C/R plugin. In that case,
the middleware knows.

Otherwise, for such optimization cases I expect applications to come with
dedicated C/R helpers. So, yes, the application developer is involved.

However, in some special cases like stdio redirection of containers,
administrators should be able to do it, or even users. Imagine a user feeding
some file to the checkpointed app, and wanting the app to work on a different
file at restart.  With a parsing tool and enough info in the fd table of the
checkpoint image, the user could easily know which fd should be substituted
because its path matches the file that was fed to the app.

Thanks,

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <http://lists.openvz.org/pipermail/devel/attachments/20100909/970a0e69/attachment-0001.sig>
-------------- next part --------------
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers


More information about the Devel mailing list