[CRIU] overmount confusion

Pavel Emelyanov xemul at parallels.com
Tue Mar 31 12:02:51 PDT 2015


On 03/31/2015 06:02 PM, Tycho Andersen wrote:
> Hi all,
> 
> I'm trying to dump a container with the following mount info:
> 
> 71 72 253:1 /var/lib/lxd/lxc/mig/rootfs / rw,relatime - ext4 /dev/disk/by-uuid/6c5a78e0-95fa-49a8-aa91-a8093d295e58 rw,data=ordered
> 102 71 0:37 / /dev rw,relatime - tmpfs none rw,size=100k,mode=755
> 103 71 0:39 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 104 105 0:39 /sys/net /proc/sys/net rw,nosuid,nodev,noexec,relatime - proc proc rw
> 105 103 0:39 /sys /proc/sys ro,nosuid,nodev,noexec,relatime - proc proc rw
> 106 103 0:39 /sysrq-trigger /proc/sysrq-trigger ro,nosuid,nodev,noexec,relatime - proc proc rw
> 107 71 0:40 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 108 107 0:40 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 109 108 0:40 / /sys/devices/virtual/net rw,relatime - sysfs sysfs rw
> 110 109 0:40 /devices/virtual/net /sys/devices/virtual/net rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 111 108 0:6 / /sys/kernel/debug rw,relatime - debugfs none rw
> 112 108 0:10 / /sys/kernel/security rw,relatime - securityfs none rw
> 113 108 0:23 / /sys/fs/pstore rw,relatime - pstore none rw
> 114 102 0:5 /console /dev/console rw,relatime - devtmpfs udev rw,size=989132k,nr_inodes=247283,mode=755
> 115 102 0:5 /full /dev/full rw,relatime - devtmpfs udev rw,size=989132k,nr_inodes=247283,mode=755
> 116 102 0:5 /null /dev/null rw,relatime - devtmpfs udev rw,size=989132k,nr_inodes=247283,mode=755
> 117 102 0:5 /random /dev/random rw,relatime - devtmpfs udev rw,size=989132k,nr_inodes=247283,mode=755
> 118 102 0:5 /tty /dev/tty rw,relatime - devtmpfs udev rw,size=989132k,nr_inodes=247283,mode=755
> 119 102 0:5 /urandom /dev/urandom rw,relatime - devtmpfs udev rw,size=989132k,nr_inodes=247283,mode=755
> 120 102 0:5 /zero /dev/zero rw,relatime - devtmpfs udev rw,size=989132k,nr_inodes=247283,mode=755
> 121 105 0:3 /sys/fs/binfmt_misc /proc/sys/fs/binfmt_misc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 122 108 0:41 / /sys/fs/cgroup rw,relatime - tmpfs cgroup rw,size=12k,mode=755
> 123 122 0:18 /cgmanager /sys/fs/cgroup/cgmanager rw,relatime - tmpfs none rw,size=4k,mode=755
> 124 103 0:36 /proc/cpuinfo /proc/cpuinfo rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 125 103 0:36 /proc/meminfo /proc/meminfo rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 126 103 0:36 /proc/stat /proc/stat rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 127 103 0:36 /proc/uptime /proc/uptime rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 128 122 0:42 / /sys/fs/cgroup rw,relatime - tmpfs none rw,size=4k,mode=755
> 129 128 0:18 /cgmanager /sys/fs/cgroup/cgmanager rw,relatime - tmpfs none rw,size=4k,mode=755
> 130 128 0:36 /cgroup/blkio /sys/fs/cgroup/blkio rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 131 128 0:36 /cgroup/cpu /sys/fs/cgroup/cpu rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 132 128 0:36 /cgroup/cpuacct /sys/fs/cgroup/cpuacct rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 133 128 0:36 /cgroup/cpuset /sys/fs/cgroup/cpuset rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 134 128 0:36 /cgroup/devices /sys/fs/cgroup/devices rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 135 128 0:36 /cgroup/freezer /sys/fs/cgroup/freezer rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 136 128 0:36 /cgroup/hugetlb /sys/fs/cgroup/hugetlb rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 137 128 0:36 /cgroup/memory /sys/fs/cgroup/memory rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 138 128 0:36 /cgroup/name=systemd /sys/fs/cgroup/systemd rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 139 128 0:36 /cgroup/net_cls /sys/fs/cgroup/net_cls rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 140 128 0:36 /cgroup/net_prio /sys/fs/cgroup/net_prio rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 141 128 0:36 /cgroup/perf_event /sys/fs/cgroup/perf_event rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
> 73 102 0:43 / /dev/pts rw,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=666

OMG! Just curious -- what kind of container is it? %)

> In particular the lines:
> 
> 122 108 0:41 / /sys/fs/cgroup rw,relatime - tmpfs cgroup rw,size=12k,mode=755
> 128 122 0:42 / /sys/fs/cgroup rw,relatime - tmpfs none rw,size=4k,mode=755
> 
> 123 122 0:18 /cgmanager /sys/fs/cgroup/cgmanager rw,relatime - tmpfs none rw,size=4k,mode=755
> 129 128 0:18 /cgmanager /sys/fs/cgroup/cgmanager rw,relatime - tmpfs none rw,size=4k,mode=755
> 
> are interesting. If I try to dump this container, criu tells me:
> 
> (00.003931) Error (mount.c:636): 123:./sys/fs/cgroup/cgmanager is overmounted

Hm... Looks like yes. It's overmounted by the /sys/fs/cgroups itself, isn't it?

> I patched it to have a to allow overmounts (i.e. skip this warning if
> a flag is passed), but then it fails to open mount 122 with:
> 
> (00.139107) Error (mount.c:762): The file system 0x29 (0x2a) tmpfs ./sys/fs/cgroup is inaccessible
> 
> so it seems that the current overmount detection code is not
> aggressive enough, since it only checks the sibling mounts instead of
> the whole mount tree.

I think the code is correct. We seek for overmounts on m's parent only
because it m is overmounted by something higher, then the m's parent
will be overmounted too and CRIU will detect this when checking m's
parent itself.

> Two questions:
> 
> 1. Should the overmount code actually check the whole tree? If so, I
>    can send a patch.
> 2. What can we do in the overmount case? As I understand it, the only
>    answer is "add an --allow-overmounts flag and trust the user to
>    know what she's doing". Is there something better?

I'm not sure that hoping that user knows what he's doing would work.
Upon restore we will have to do something with overmounted mounts
anyway and ignoring them is probably not an option.

We actually have a big issue with overmounts. Overmounted can also be
an opened file and we don't dump this case too. Mapped files, cwd-s and
unix sockets also complicate things, especially the sockets. In the
perfect world we should have some way to look-up a directory by a path
with an ability to "dive" under certain mount points along this path.
Then we can open() such things and mount() to them. But this is not an
API kernel guys would allow us to have :) and I understand why. When
thinking how to overcome this with existing kernel APIs, I found only
two ways to deal with overmounts.

First is to temporarily move required mountpoints off the way, then
doing what we need (open/mount/bind/chdir) then moving it back. The 
second way would be to open() all the mountpoints we create at the 
time we create them and then fixing all the path resolution code to 
use openat() instead of open()-s (mountat()-s instead of mount()-s).

The first way is more time consuming as each path resolve may result
in two move-mount-s. And it's not clear how propagation would work in
this case :(

The second way looks more promising to me, but we don't have the
bindat() system call :( Also it's not quite clear where to keep this
amount of opened fds during the whole restore, but this is minor.

-- Pavel



More information about the CRIU mailing list