[CRIU] overmount confusion

Tue Mar 31 13:27:04 PDT 2015

Hi Pavel,

On Tue, Mar 31, 2015 at 10:02:51PM +0300, Pavel Emelyanov wrote:
> On 03/31/2015 06:02 PM, Tycho Andersen wrote:
> > Hi all,
> > 
> > I'm trying to dump a container with the following mount info:
> > 
> > 71 72 253:1 /var/lib/lxd/lxc/mig/rootfs / rw,relatime - ext4 /dev/disk/by-uuid/6c5a78e0-95fa-49a8-aa91-a8093d295e58 rw,data=ordered
> > 102 71 0:37 / /dev rw,relatime - tmpfs none rw,size=100k,mode=755
> > 103 71 0:39 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> > 104 105 0:39 /sys/net /proc/sys/net rw,nosuid,nodev,noexec,relatime - proc proc rw
> > 105 103 0:39 /sys /proc/sys ro,nosuid,nodev,noexec,relatime - proc proc rw
> > 106 103 0:39 /sysrq-trigger /proc/sysrq-trigger ro,nosuid,nodev,noexec,relatime - proc proc rw
> > 107 71 0:40 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> > 108 107 0:40 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> > 109 108 0:40 / /sys/devices/virtual/net rw,relatime - sysfs sysfs rw
> > 110 109 0:40 /devices/virtual/net /sys/devices/virtual/net rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> > 111 108 0:6 / /sys/kernel/debug rw,relatime - debugfs none rw
> > 112 108 0:10 / /sys/kernel/security rw,relatime - securityfs none rw
> > 113 108 0:23 / /sys/fs/pstore rw,relatime - pstore none rw
> > 114 102 0:5 /console /dev/console rw,relatime - devtmpfs udev rw,size=989132k,nr_inodes=247283,mode=755
> > 115 102 0:5 /full /dev/full rw,relatime - devtmpfs udev rw,size=989132k,nr_inodes=247283,mode=755
> > 116 102 0:5 /null /dev/null rw,relatime - devtmpfs udev rw,size=989132k,nr_inodes=247283,mode=755
> > 117 102 0:5 /random /dev/random rw,relatime - devtmpfs udev rw,size=989132k,nr_inodes=247283,mode=755
> > 118 102 0:5 /tty /dev/tty rw,relatime - devtmpfs udev rw,size=989132k,nr_inodes=247283,mode=755
> > 119 102 0:5 /urandom /dev/urandom rw,relatime - devtmpfs udev rw,size=989132k,nr_inodes=247283,mode=755
> > 120 102 0:5 /zero /dev/zero rw,relatime - devtmpfs udev rw,size=989132k,nr_inodes=247283,mode=755
> > 121 105 0:3 /sys/fs/binfmt_misc /proc/sys/fs/binfmt_misc rw,nosuid,nodev,noexec,relatime - proc proc rw
> > 122 108 0:41 / /sys/fs/cgroup rw,relatime - tmpfs cgroup rw,size=12k,mode=755
> > 123 122 0:18 /cgmanager /sys/fs/cgroup/cgmanager rw,relatime - tmpfs none rw,size=4k,mode=755
> > 124 103 0:36 /proc/cpuinfo /proc/cpuinfo rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 125 103 0:36 /proc/meminfo /proc/meminfo rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 126 103 0:36 /proc/stat /proc/stat rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 127 103 0:36 /proc/uptime /proc/uptime rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 128 122 0:42 / /sys/fs/cgroup rw,relatime - tmpfs none rw,size=4k,mode=755
> > 129 128 0:18 /cgmanager /sys/fs/cgroup/cgmanager rw,relatime - tmpfs none rw,size=4k,mode=755
> > 130 128 0:36 /cgroup/blkio /sys/fs/cgroup/blkio rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 131 128 0:36 /cgroup/cpu /sys/fs/cgroup/cpu rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 132 128 0:36 /cgroup/cpuacct /sys/fs/cgroup/cpuacct rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 133 128 0:36 /cgroup/cpuset /sys/fs/cgroup/cpuset rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 134 128 0:36 /cgroup/devices /sys/fs/cgroup/devices rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 135 128 0:36 /cgroup/freezer /sys/fs/cgroup/freezer rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 136 128 0:36 /cgroup/hugetlb /sys/fs/cgroup/hugetlb rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 137 128 0:36 /cgroup/memory /sys/fs/cgroup/memory rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 138 128 0:36 /cgroup/name=systemd /sys/fs/cgroup/systemd rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 139 128 0:36 /cgroup/net_cls /sys/fs/cgroup/net_cls rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 140 128 0:36 /cgroup/net_prio /sys/fs/cgroup/net_prio rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 141 128 0:36 /cgroup/perf_event /sys/fs/cgroup/perf_event rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> > 73 102 0:43 / /dev/pts rw,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=666
> 
> OMG! Just curious -- what kind of container is it? %)

Yeah, all the dancing about bind mounts everywhere is for systemd as I
understand it. Probably Serge can tell you more details.

> > In particular the lines:
> > 
> > 122 108 0:41 / /sys/fs/cgroup rw,relatime - tmpfs cgroup rw,size=12k,mode=755
> > 128 122 0:42 / /sys/fs/cgroup rw,relatime - tmpfs none rw,size=4k,mode=755
> > 
> > 123 122 0:18 /cgmanager /sys/fs/cgroup/cgmanager rw,relatime - tmpfs none rw,size=4k,mode=755
> > 129 128 0:18 /cgmanager /sys/fs/cgroup/cgmanager rw,relatime - tmpfs none rw,size=4k,mode=755
> > 
> > are interesting. If I try to dump this container, criu tells me:
> > 
> > (00.003931) Error (mount.c:636): 123:./sys/fs/cgroup/cgmanager is overmounted
> 
> Hm... Looks like yes. It's overmounted by the /sys/fs/cgroups itself, isn't it?

I /think/ so :)

> > I patched it to have a to allow overmounts (i.e. skip this warning if
> > a flag is passed), but then it fails to open mount 122 with:
> > 
> > (00.139107) Error (mount.c:762): The file system 0x29 (0x2a) tmpfs ./sys/fs/cgroup is inaccessible
> > 
> > so it seems that the current overmount detection code is not
> > aggressive enough, since it only checks the sibling mounts instead of
> > the whole mount tree.
> 
> I think the code is correct. We seek for overmounts on m's parent only
> because it m is overmounted by something higher, then the m's parent
> will be overmounted too and CRIU will detect this when checking m's
> parent itself.

But shouldn't it detect the /sys/fs/cgroup case above?

> > Two questions:
> > 
> > 1. Should the overmount code actually check the whole tree? If so, I
> >    can send a patch.
> > 2. What can we do in the overmount case? As I understand it, the only
> >    answer is "add an --allow-overmounts flag and trust the user to
> >    know what she's doing". Is there something better?
> 
> I'm not sure that hoping that user knows what he's doing would work.
> Upon restore we will have to do something with overmounted mounts
> anyway and ignoring them is probably not an option.

I guess it works for my case (because I know that the stuff was
mounted before the container started, so there won't be any open
files in the undermount), but yeah, that doesn't work generally.

> We actually have a big issue with overmounts. Overmounted can also be
> an opened file and we don't dump this case too. Mapped files, cwd-s and
> unix sockets also complicate things, especially the sockets. In the
> perfect world we should have some way to look-up a directory by a path
> with an ability to "dive" under certain mount points along this path.
> Then we can open() such things and mount() to them. But this is not an
> API kernel guys would allow us to have :) and I understand why. When
> thinking how to overcome this with existing kernel APIs, I found only
> two ways to deal with overmounts.
> 
> First is to temporarily move required mountpoints off the way, then
> doing what we need (open/mount/bind/chdir) then moving it back. The 
> second way would be to open() all the mountpoints we create at the 
> time we create them and then fixing all the path resolution code to 
> use openat() instead of open()-s (mountat()-s instead of mount()-s).

Sorry, I didn't understand the second way; that is a strategy on
restore, but what happens on dump? How do you get at the underlying
mountpoint?

Tycho

> The first way is more time consuming as each path resolve may result
> in two move-mount-s. And it's not clear how propagation would work in
> this case :(
> 
> The second way looks more promising to me, but we don't have the
> bindat() system call :( Also it's not quite clear where to keep this
> amount of opened fds during the whole restore, but this is minor.
>
> -- Pavel
>