[Devel] Re: [BIG RFC] Filesystem-based checkpoint

Fri Oct 31 07:21:42 PDT 2008

On Thu, 2008-10-30 at 20:12 -0700, Eric W. Biederman wrote:
> Dave Hansen <dave at linux.vnet.ibm.com> writes:
> >> > System calls in Linux are fast.  Doing lots of them is not a problem.
> >> > If it becomes one, we can always export a condensed version of this
> >> > format next to the expanded one, kinda like ftrace does.  Atomicity with
> >> > this approach is also not a problem.  The system call in this approach
> >> > doesn't return until the checkpoint is completely written out.
> >> 
> >> Extra copies for something (memory) you want to transfer quickly
> >> and efficiently is a problem.
> >
> > That's definitely true.  But, as I said, this approach isn't bound to
> > copying everything.  We have the flexibility to choose what we do.
> 
> With a file descriptor I can push the data onto a network socket and
> the receiving process is on another computer.  0 copies, 0 trips
> to user space.  I'm not certain how you would achieve that with filesystem
> approach.

for sys_checkpoint() does:
	1. copy from task_struct (or whatever kernel struct) into buffer
	2. run vfs_write() with that buffer and the user fd
	3. fd target reads from that buffer

The fs approach would:
	1. user calls read()
	2. fs fills data in directly into *userspace* buffer
	3. user does sendfile, etc...

See?  sys_checkpoint() *does* a copy.  It just does it into a kernel
buffer.  That's why we need to call vfs_write().

> I'm saying inspecting another process is a very racy operation so something
> we need to be especially careful with. 

No disagreement from me on that one.  

> >> Ultimately the question is how do you do checkpoint restore and I just
> >> don't see that happening with a filesystem interface.  Way way way too many
> >> dangerous syscalls that are only needed for one thing.
> >
> > I completely understand what you're saying here.  But, could you
> > distinguish how this differs from the current way that sys_checkpoint()
> > does it?  Surely, the checkpoint format is an ABI.  It is a complex ABI
> > with many, many constituent structures.  This is an ABI with many, many,
> > ways of reading simple data.  Seems like just slicing up the problem
> > differently to me.
> 
> I was thinking about restore.  Creating objects with a certain id can
> easily be a security risk if you are not creating the namespace those
> objects live in at the same time.  There is currently the downside
> that we can't create namespaces as unprivileged users ( The
> implementation of suid is so annoying). But the general concept still
> applies, and if we ever get the uid namespace correct we will be able
> to create namespaces as unprivileged users.

Eric, you were saying that my interface had way too many "dangerous
syscalls".  How does this relate to user namespaces and creating objects
with particular ids?  Surely if the true problem with my suggested
approach has to do with creating empty namespaces, the same problem
exists with the sys_checkpoint() approach.

-- Dave

_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers