[CRIU] overmount confusion

Wed Apr 1 12:58:10 PDT 2015

On 04/01/2015 08:50 PM, Tycho Andersen wrote:
> Hi Pavel,
> 
> On Wed, Apr 01, 2015 at 06:00:33PM +0300, Pavel Emelyanov wrote:
>>
>>>>>>> I patched it to have a to allow overmounts (i.e. skip this warning if
>>>>>>> a flag is passed), but then it fails to open mount 122 with:
>>>>>>>
>>>>>>> (00.139107) Error (mount.c:762): The file system 0x29 (0x2a) tmpfs ./sys/fs/cgroup is inaccessible
>>>>>>>
>>>>>>> so it seems that the current overmount detection code is not
>>>>>>> aggressive enough, since it only checks the sibling mounts instead of
>>>>>>> the whole mount tree.
>>>>>>
>>>>>> I think the code is correct. We seek for overmounts on m's parent only
>>>>>> because it m is overmounted by something higher, then the m's parent
>>>>>> will be overmounted too and CRIU will detect this when checking m's
>>>>>> parent itself.
>>>>>
>>>>> But shouldn't it detect the /sys/fs/cgroup case above?
>>>>
>>>> Well, I believe it is. You get the "is overmounted" message on unmodified
>>>> CRIU sources, don't you?
>>>
>>> Only for /sys/fs/cgroup/cgmanager (and I set a flag there to avoid
>>> trying to do any work when dumping it later). For this one it doesn't
>>> detect that it is overmounted, so that flag isn't set and my code
>>> doesn't mount it.
>>
>> Hm... The 22:/sys/fs/cgroup is not overmounted according to mountinfo.
>> The directory itself has another mount on top of it (128'th one), but
>> it doesn't count as overmount.
> 
> Oh, ok. I thought these were the same thing. Doesn't it at least have
> the same problem (i.e. the underlying mount being inaccessable) and
> potential set of solutions?

Yes, it has :) If someone opens /sys/fs/cgroup/mem/tasks, then mounts
tmpfs on top of /sys/fs/cgroup then the opened file becomes overmounted
and we have to do moving/diving tricks we're discussing below.

>>>>>>> Two questions:
>>>>>>>
>>>>>>> 1. Should the overmount code actually check the whole tree? If so, I
>>>>>>>    can send a patch.
>>>>>>> 2. What can we do in the overmount case? As I understand it, the only
>>>>>>>    answer is "add an --allow-overmounts flag and trust the user to
>>>>>>>    know what she's doing". Is there something better?
>>>>>>
>>>>>> I'm not sure that hoping that user knows what he's doing would work.
>>>>>> Upon restore we will have to do something with overmounted mounts
>>>>>> anyway and ignoring them is probably not an option.
>>>>>
>>>>> I guess it works for my case (because I know that the stuff was
>>>>> mounted before the container started, so there won't be any open
>>>>> files in the undermount), but yeah, that doesn't work generally.
>>>>>
>>>>>> We actually have a big issue with overmounts. Overmounted can also be
>>>>>> an opened file and we don't dump this case too. Mapped files, cwd-s and
>>>>>> unix sockets also complicate things, especially the sockets. In the
>>>>>> perfect world we should have some way to look-up a directory by a path
>>>>>> with an ability to "dive" under certain mount points along this path.
>>>>>> Then we can open() such things and mount() to them. But this is not an
>>>>>> API kernel guys would allow us to have :) and I understand why. When
>>>>>> thinking how to overcome this with existing kernel APIs, I found only
>>>>>> two ways to deal with overmounts.
>>>>>>
>>>>>> First is to temporarily move required mountpoints off the way, then
>>>>>> doing what we need (open/mount/bind/chdir) then moving it back. The 
>>>>>> second way would be to open() all the mountpoints we create at the 
>>>>>> time we create them and then fixing all the path resolution code to 
>>>>>> use openat() instead of open()-s (mountat()-s instead of mount()-s).
>>>>>
>>>>> Sorry, I didn't understand the second way; that is a strategy on
>>>>> restore, but what happens on dump? How do you get at the underlying
>>>>> mountpoint?
>>>>
>>>> Yes, that's the strategy for restore. As far as the dump is concerned -- we
>>>> only need to "get" one in terms of reading the information from /proc. One
>>>> exception from this rule would be tmpfs which we need to tar. For such cases
>>>> we can use the former technique -- move the overmounted mountpoints aside
>>>> temporarily.
>>>
>>> By reading the information from proc here you mean reading things like
>>> open/ghost fds as well as the actual order in which things are
>>> mounted, right?
>>
>> Well, I mean on dump we only need the mounts tree that can be get from
>> /proc/pid/mountinfo. We don't need to mess with the FS-s themselves. Even
>> if we dump a file that is opened and then overmounted we don't need to
>> "dive under" the hiding mountpoint or move it aside -- we just check that 
>> the path we think this file has (by readlink-ing the /proc/pid/fd link) is
>> resolved into wrong one (by stat()-ing this path), then we compare the
>> mount-id of this file (got from /proc/pid/fdinfo/fd) with the information
>> of mounts tree we have and see that the files is indeed overmounted. That's
>> (should be) enough for dump. On restore we'll have to open this file "under"
>> the overmounting mount and for this we would need to "dive under" or move
>> mounts.
> 
> I see (unless the underlying mount is a tmpfs, I guess, then we still
> have to "dive under" to tar it, right?). 

Yes, to tar tmpfs we have to somehow get the whole tree. For non-overmounted
tmpfs we bind-mount its root to temporary location and tar it. For overmounted
tmpfs we can't simply do it.

> It does seem nicer to do
> *at() everywhere. Do we need a mount/bind-at system call if we mount
> things in the right order (but do an open(O_DIR) before mounting
> things on top of it, as you suggested above)?

Almost. If we have / mountpoint and /foo mountpoint and want to a do something
with the /foo/bar file which is not on /foo's fs, but the fs that used to be
on / before we mounted /foo, then we need fd pointing to /foo _before_ mounting
on it to openat() on it (and effectively dive under /foo fs). Having fs on
/foo's fs root (i.e. -- /foo after mountpoint was created) wouldn't help much :)

-- Pavel