[CRIU] overmount confusion

Wed Apr 1 10:50:25 PDT 2015

Hi Pavel,

On Wed, Apr 01, 2015 at 06:00:33PM +0300, Pavel Emelyanov wrote:
> 
> >>>>> I patched it to have a to allow overmounts (i.e. skip this warning if
> >>>>> a flag is passed), but then it fails to open mount 122 with:
> >>>>>
> >>>>> (00.139107) Error (mount.c:762): The file system 0x29 (0x2a) tmpfs ./sys/fs/cgroup is inaccessible
> >>>>>
> >>>>> so it seems that the current overmount detection code is not
> >>>>> aggressive enough, since it only checks the sibling mounts instead of
> >>>>> the whole mount tree.
> >>>>
> >>>> I think the code is correct. We seek for overmounts on m's parent only
> >>>> because it m is overmounted by something higher, then the m's parent
> >>>> will be overmounted too and CRIU will detect this when checking m's
> >>>> parent itself.
> >>>
> >>> But shouldn't it detect the /sys/fs/cgroup case above?
> >>
> >> Well, I believe it is. You get the "is overmounted" message on unmodified
> >> CRIU sources, don't you?
> > 
> > Only for /sys/fs/cgroup/cgmanager (and I set a flag there to avoid
> > trying to do any work when dumping it later). For this one it doesn't
> > detect that it is overmounted, so that flag isn't set and my code
> > doesn't mount it.
> 
> Hm... The 22:/sys/fs/cgroup is not overmounted according to mountinfo.
> The directory itself has another mount on top of it (128'th one), but
> it doesn't count as overmount.

Oh, ok. I thought these were the same thing. Doesn't it at least have
the same problem (i.e. the underlying mount being inaccessable) and
potential set of solutions?

> >>>>> Two questions:
> >>>>>
> >>>>> 1. Should the overmount code actually check the whole tree? If so, I
> >>>>>    can send a patch.
> >>>>> 2. What can we do in the overmount case? As I understand it, the only
> >>>>>    answer is "add an --allow-overmounts flag and trust the user to
> >>>>>    know what she's doing". Is there something better?
> >>>>
> >>>> I'm not sure that hoping that user knows what he's doing would work.
> >>>> Upon restore we will have to do something with overmounted mounts
> >>>> anyway and ignoring them is probably not an option.
> >>>
> >>> I guess it works for my case (because I know that the stuff was
> >>> mounted before the container started, so there won't be any open
> >>> files in the undermount), but yeah, that doesn't work generally.
> >>>
> >>>> We actually have a big issue with overmounts. Overmounted can also be
> >>>> an opened file and we don't dump this case too. Mapped files, cwd-s and
> >>>> unix sockets also complicate things, especially the sockets. In the
> >>>> perfect world we should have some way to look-up a directory by a path
> >>>> with an ability to "dive" under certain mount points along this path.
> >>>> Then we can open() such things and mount() to them. But this is not an
> >>>> API kernel guys would allow us to have :) and I understand why. When
> >>>> thinking how to overcome this with existing kernel APIs, I found only
> >>>> two ways to deal with overmounts.
> >>>>
> >>>> First is to temporarily move required mountpoints off the way, then
> >>>> doing what we need (open/mount/bind/chdir) then moving it back. The 
> >>>> second way would be to open() all the mountpoints we create at the 
> >>>> time we create them and then fixing all the path resolution code to 
> >>>> use openat() instead of open()-s (mountat()-s instead of mount()-s).
> >>>
> >>> Sorry, I didn't understand the second way; that is a strategy on
> >>> restore, but what happens on dump? How do you get at the underlying
> >>> mountpoint?
> >>
> >> Yes, that's the strategy for restore. As far as the dump is concerned -- we
> >> only need to "get" one in terms of reading the information from /proc. One
> >> exception from this rule would be tmpfs which we need to tar. For such cases
> >> we can use the former technique -- move the overmounted mountpoints aside
> >> temporarily.
> > 
> > By reading the information from proc here you mean reading things like
> > open/ghost fds as well as the actual order in which things are
> > mounted, right?
> 
> Well, I mean on dump we only need the mounts tree that can be get from
> /proc/pid/mountinfo. We don't need to mess with the FS-s themselves. Even
> if we dump a file that is opened and then overmounted we don't need to
> "dive under" the hiding mountpoint or move it aside -- we just check that 
> the path we think this file has (by readlink-ing the /proc/pid/fd link) is
> resolved into wrong one (by stat()-ing this path), then we compare the
> mount-id of this file (got from /proc/pid/fdinfo/fd) with the information
> of mounts tree we have and see that the files is indeed overmounted. That's
> (should be) enough for dump. On restore we'll have to open this file "under"
> the overmounting mount and for this we would need to "dive under" or move
> mounts.

I see (unless the underlying mount is a tmpfs, I guess, then we still
have to "dive under" to tar it, right?). It does seem nicer to do
*at() everywhere. Do we need a mount/bind-at system call if we mount
things in the right order (but do an open(O_DIR) before mounting
things on top of it, as you suggested above)?

Thanks,

Tycho

> -- Pavel
>