[Devel] Re: [BIG RFC] Filesystem-based checkpoint

Eric W. Biederman ebiederm at xmission.com
Thu Oct 30 20:12:28 PDT 2008


Dave Hansen <dave at linux.vnet.ibm.com> writes:

> On Thu, 2008-10-30 at 16:33 -0700, Eric W. Biederman wrote:

>> If you called it a core dump instead of a checkpoint you have exactly the same
> set
>> of issues.
>
> I completely agree with you that there's a lot of common ground here
> between coredumps and checkpoints.  I'm not aware of any applications
> like, let's say Oracle, that use coredumps in the process of normal
> execution.  Checkpoints must be more scalable and lower overhead than
> coredumps are.

Checkpoints certainly need to be as light weight as we can make them.

Checkpoints as backup of where you are in case the machine crashes
I'm not certain I believe in.  Checkpoints for saving state over
a kernel upgrade or for migrating to a different machine make a lot
of sense to me.

>> Why we are doing vfs_write instead of file->f_op->write I don't understand.
>
> That's an excellent question.  I assume you're asking because at least
> the elf core dump code uses it, right?

Yes.

>> > System calls in Linux are fast.  Doing lots of them is not a problem.
>> > If it becomes one, we can always export a condensed version of this
>> > format next to the expanded one, kinda like ftrace does.  Atomicity with
>> > this approach is also not a problem.  The system call in this approach
>> > doesn't return until the checkpoint is completely written out.
>> 
>> Extra copies for something (memory) you want to transfer quickly
>> and efficiently is a problem.
>
> That's definitely true.  But, as I said, this approach isn't bound to
> copying everything.  We have the flexibility to choose what we do.

With a file descriptor I can push the data onto a network socket and
the receiving process is on another computer.  0 copies, 0 trips
to user space.  I'm not certain how you would achieve that with filesystem
approach.

>> Reading the memory of another process is a problem, to the point
>> that the /proc/<pid>/mem interface has been removed from the kernel.
>
> Yes, this is certainly true.  All of the ptrace-related security issues
> surely tell us something.  But, I'm not sure of your point here.  Are
> you saying that using sys_checkpoint() to dump a process's pages is
> inherently safer than approach that uses a filesystem in order to do the
> same?

I'm saying inspecting another process is a very racy operation so something
we need to be especially careful with. 

>> > Want to do a fast checkpoint?  Fine, copy all data, use a lot of memory,
>> > store it in-kernel.  Dump that out when the filesystem is accessed.
>> > Destroy it when userspace asks.
>> 
>> > So, why not?
>> 
>> Besides the part of creating a bunch of questionable interfaces
>> that we need to support forever.
>> 
>> Ultimately the question is how do you do checkpoint restore and I just
>> don't see that happening with a filesystem interface.  Way way way too many
>> dangerous syscalls that are only needed for one thing.
>
> I completely understand what you're saying here.  But, could you
> distinguish how this differs from the current way that sys_checkpoint()
> does it?  Surely, the checkpoint format is an ABI.  It is a complex ABI
> with many, many constituent structures.  This is an ABI with many, many,
> ways of reading simple data.  Seems like just slicing up the problem
> differently to me.

I was thinking about restore.  Creating objects with a certain id can
easily be a security risk if you are not creating the namespace those
objects live in at the same time.  There is currently the downside
that we can't create namespaces as unprivileged users ( The
implementation of suid is so annoying). But the general concept still
applies, and if we ever get the uid namespace correct we will be able
to create namespaces as unprivileged users.

Eric





_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers




More information about the Devel mailing list