[CRIU] overmount confusion

Wed Apr 1 07:23:21 PDT 2015

On Wed, Apr 01, 2015 at 10:16:01AM +0300, Pavel Emelyanov wrote:
> 
> >>> In particular the lines:
> >>>
> >>> 122 108 0:41 / /sys/fs/cgroup rw,relatime - tmpfs cgroup rw,size=12k,mode=755
> >>> 128 122 0:42 / /sys/fs/cgroup rw,relatime - tmpfs none rw,size=4k,mode=755
> >>>
> >>> 123 122 0:18 /cgmanager /sys/fs/cgroup/cgmanager rw,relatime - tmpfs none rw,size=4k,mode=755
> >>> 129 128 0:18 /cgmanager /sys/fs/cgroup/cgmanager rw,relatime - tmpfs none rw,size=4k,mode=755
> >>>
> >>> are interesting. If I try to dump this container, criu tells me:
> >>>
> >>> (00.003931) Error (mount.c:636): 123:./sys/fs/cgroup/cgmanager is overmounted
> >>
> >> Hm... Looks like yes. It's overmounted by the /sys/fs/cgroups itself, isn't it?
> > 
> > I /think/ so :)
> 
> :D
> 
> >>> I patched it to have a to allow overmounts (i.e. skip this warning if
> >>> a flag is passed), but then it fails to open mount 122 with:
> >>>
> >>> (00.139107) Error (mount.c:762): The file system 0x29 (0x2a) tmpfs ./sys/fs/cgroup is inaccessible
> >>>
> >>> so it seems that the current overmount detection code is not
> >>> aggressive enough, since it only checks the sibling mounts instead of
> >>> the whole mount tree.
> >>
> >> I think the code is correct. We seek for overmounts on m's parent only
> >> because it m is overmounted by something higher, then the m's parent
> >> will be overmounted too and CRIU will detect this when checking m's
> >> parent itself.
> > 
> > But shouldn't it detect the /sys/fs/cgroup case above?
> 
> Well, I believe it is. You get the "is overmounted" message on unmodified
> CRIU sources, don't you?

Only for /sys/fs/cgroup/cgmanager (and I set a flag there to avoid
trying to do any work when dumping it later). For this one it doesn't
detect that it is overmounted, so that flag isn't set and my code
doesn't mount it.

> >>> Two questions:
> >>>
> >>> 1. Should the overmount code actually check the whole tree? If so, I
> >>>    can send a patch.
> >>> 2. What can we do in the overmount case? As I understand it, the only
> >>>    answer is "add an --allow-overmounts flag and trust the user to
> >>>    know what she's doing". Is there something better?
> >>
> >> I'm not sure that hoping that user knows what he's doing would work.
> >> Upon restore we will have to do something with overmounted mounts
> >> anyway and ignoring them is probably not an option.
> > 
> > I guess it works for my case (because I know that the stuff was
> > mounted before the container started, so there won't be any open
> > files in the undermount), but yeah, that doesn't work generally.
> > 
> >> We actually have a big issue with overmounts. Overmounted can also be
> >> an opened file and we don't dump this case too. Mapped files, cwd-s and
> >> unix sockets also complicate things, especially the sockets. In the
> >> perfect world we should have some way to look-up a directory by a path
> >> with an ability to "dive" under certain mount points along this path.
> >> Then we can open() such things and mount() to them. But this is not an
> >> API kernel guys would allow us to have :) and I understand why. When
> >> thinking how to overcome this with existing kernel APIs, I found only
> >> two ways to deal with overmounts.
> >>
> >> First is to temporarily move required mountpoints off the way, then
> >> doing what we need (open/mount/bind/chdir) then moving it back. The 
> >> second way would be to open() all the mountpoints we create at the 
> >> time we create them and then fixing all the path resolution code to 
> >> use openat() instead of open()-s (mountat()-s instead of mount()-s).
> > 
> > Sorry, I didn't understand the second way; that is a strategy on
> > restore, but what happens on dump? How do you get at the underlying
> > mountpoint?
> 
> Yes, that's the strategy for restore. As far as the dump is concerned -- we
> only need to "get" one in terms of reading the information from /proc. One
> exception from this rule would be tmpfs which we need to tar. For such cases
> we can use the former technique -- move the overmounted mountpoints aside
> temporarily.

By reading the information from proc here you mean reading things like
open/ghost fds as well as the actual order in which things are
mounted, right?

Tycho

> -- Pavel
>