[Devel] Re: [BIG RFC] Filesystem-based checkpoint

Thu Oct 30 17:09:17 PDT 2008

On Thu, 2008-10-30 at 16:33 -0700, Eric W. Biederman wrote:
> Dave Hansen <dave at linux.vnet.ibm.com> writes:
> > I hate the syscall.  It's a very un-Linux-y way of doing things.  There,
> > I said it.  Here's an alternative.  It still uses the syscall to
> > initiate things, but it uses debugfs to transport the data instead.
> > This is just a concept demonstration.  It doesn't actually work, and I
> > wouldn't be using debugfs in practice.
> 
> A syscall is a very linux-y way to do it.

Darn, I thought I'd be able to sneak that one by.

> If you called it a core dump instead of a checkpoint you have exactly the same set
> of issues.

I completely agree with you that there's a lot of common ground here
between coredumps and checkpoints.  I'm not aware of any applications
like, let's say Oracle, that use coredumps in the process of normal
execution.  Checkpoints must be more scalable and lower overhead than
coredumps are.

> Why we are doing vfs_write instead of file->f_op->write I don't understand.

That's an excellent question.  I assume you're asking because at least
the elf core dump code uses it, right?

> > System calls in Linux are fast.  Doing lots of them is not a problem.
> > If it becomes one, we can always export a condensed version of this
> > format next to the expanded one, kinda like ftrace does.  Atomicity with
> > this approach is also not a problem.  The system call in this approach
> > doesn't return until the checkpoint is completely written out.
> 
> Extra copies for something (memory) you want to transfer quickly
> and efficiently is a problem.

That's definitely true.  But, as I said, this approach isn't bound to
copying everything.  We have the flexibility to choose what we do.

> Reading the memory of another process is a problem, to the point
> that the /proc/<pid>/mem interface has been removed from the kernel.

Yes, this is certainly true.  All of the ptrace-related security issues
surely tell us something.  But, I'm not sure of your point here.  Are
you saying that using sys_checkpoint() to dump a process's pages is
inherently safer than approach that uses a filesystem in order to do the
same?

> > Want to do a fast checkpoint?  Fine, copy all data, use a lot of memory,
> > store it in-kernel.  Dump that out when the filesystem is accessed.
> > Destroy it when userspace asks.
> 
> > So, why not?
> 
> Besides the part of creating a bunch of questionable interfaces
> that we need to support forever.
> 
> Ultimately the question is how do you do checkpoint restore and I just
> don't see that happening with a filesystem interface.  Way way way too many
> dangerous syscalls that are only needed for one thing.

I completely understand what you're saying here.  But, could you
distinguish how this differs from the current way that sys_checkpoint()
does it?  Surely, the checkpoint format is an ABI.  It is a complex ABI
with many, many constituent structures.  This is an ABI with many, many,
ways of reading simple data.  Seems like just slicing up the problem
differently to me.

-- Dave

_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers