[CRIU] [PATCH] mnt: allow to add a mount into an existing group

Pavel Tikhomirov ptikhomirov at virtuozzo.com
Wed Mar 24 17:52:08 MSK 2021


Adding Andrew to CC with the right email.

On 3/23/21 3:59 PM, Pavel Tikhomirov wrote:
> Hi! Can we restart the discussion on this topic?
> 
> In CRIU we need to be able to dump/restore all mount trees of system 
> container (CT). CT can have anything inside - users which create their 
> custom mounts configuration, systemd with custom mount namespaces for 
> it's services, nested application containers inside the CT with their 
> own mount namespaces, and all mounts in CT mount trees can be grouped by 
> sharing groupes (e.g. same shared_id + master_id pair), and those groups 
> can depend one from another forming a tree structure of sharing groups.
> 
> 1) Imagine that we have this sharing group tree (in format (shared_id, 
> master_id), 0 means no sharing, we don't care about actual mounts for 
> now only master-slave dependencies between sharing groups):
> 
> (1,0)
>    |- (2,1)
>    |- (3,1)
>         |- (4,3)
>              |- (0,4)
> 
> The main problem of restoring mounts is the fact that sharing groups 
> currently can be only inherited, e.g. if you have one mount (first) with 
> shared_id = x, master_id = y, the only way to get another mount with 
> (x,y) is to create a bindmount from the first mount. Also to create 
> mount (y,z) from mount (x,y) one should also first inherit (x,y) via 
> bindmount and than change to (y,z).
> 
> This means that mentioned above tree puts restriction on the mounts 
> creation order, one need to have at least one mount for each of sharing 
> groups (1,0), (3,1) and (4,3) before creating the first mount of the 
> sharing group (0,4).
> 
> But what if we want to mount (restore) actual mounts in this mount tree 
> "reverse" order:
> 
> mntid    parent    mountpoint    (shared_id, master_id)
> 101    0    /tmp        (0,4)
> 102    101    /tmp        (4,3)
> 103    102    /tmp        (3,1)
> 104    103    /tmp        (1,0)
> 
> Mount 104's sharing group should be created before mount 101, 102 and 
> 103 sharing groups, but mount 104 should be created after those mounts. 
> One can actually prepare this setup (on mainstream kernel) by 
> pre-creating sharing groups elsewhere and then binding to /tmp in proper 
> order with careful unmounting of propagations (see test.sh attached):
> 
> [root at snorch propagation-tests]# bash ../test.sh
> ------------
> 960 1120 0:56 / /tmp/propagation-tests/tmp rw,relatime master:452 - 
> tmpfs propagation-tests-src rw,inode64
> 958 960 0:56 / /tmp/propagation-tests/tmp/sub rw,relatime shared:452 
> master:451 - tmpfs propagation-tests-src rw,inode64
> 961 958 0:56 / /tmp/propagation-tests/tmp/sub/sub rw,relatime shared:451 
> master:433 - tmpfs propagation-tests-src rw,inode64
> 963 961 0:56 / /tmp/propagation-tests/tmp/sub/sub/sub rw,relatime 
> shared:433 - tmpfs propagation-tests-src rw,inode64
> ------------
> 
> But this "pre-creating" from test.sh is not universal at all and only 
> works for this simple case. CRIU does not know anything about the 
> history of mount creation for system container, it also does not know 
> anything about any temporary mounts which were used and then removed. So 
> understanding the proper order is almost impossible like Andrew says.
> 
> I've also prepared a presentation on Linux Plumbers last year about how 
> much problems propagation brings to mounts restore in CRIU, you can take 
> a look here https://www.linuxplumbersconf.org/event/7/contributions/640/
> 
> 2) Propagation creates tons of mounts
> 3) Mount reparenting
> 4) "Mount trap"
> 5) "Non-uniform" propagation
> 6) “Cross-namespace” sharing groups
> 
> Allowing to create mounts private first and create sharing groups later 
> and copy sharing groups later instead of inheriting them resolves all 
> the problems with propagation at once.
> 
> One can take a look on the implementation of sharing group restore in 
> CRIU if we have this (mnt: allow to add a mount into an existing group) 
> patch applied: 
> https://github.com/Snorch/criu/blob/bebbded98128ec787950fa8365a6c74ce6a3b2cb/criu/mount-v2.c#L898 
> 
> 
> Obviously this does not solve all the problems with mounts I know about 
> but it's a big step forward in properly supporting them in CRIU. We 
> already have this tested in Virtuozzo for almost a year and it works nice.
> 
> Notes:
> 
> - There is another idea, but I should say early that I don't like it, 
> because with it restoring mounts with criu would be still super complex. 
> We can add extra flag to mount/move_mount syscall to disable propagation 
> temporary so that CRIU can restore the mount tree without problems 2-5, 
> also we can now create cross-namespace bindmounts with 
> (copy_tree+move_mount) to solve 6. But this solution does not help much 
> with problem 1 - ordering and the need of temporary mounts. As you can 
> see in test.sh you would still need to think hard to solve different 
> similar configurations of reverse order between mounts and sharing groups.
> 
> - We can actually prohibit cross-namespace MS_SET_GROUP if you like. (If 
> both namespaces are non abstract.) We can use open_tree to create a copy 
> of the mount with the same sharing group and only then copy sharing from 
> the copy while being in proper mountns.
> 
> - We still need it:
> 
>  > this code might be made unnecessary by allowing bind mounts between
>  > mount namespaces.
> 
> No, because of problem 1. Guessing right order would be still to complex.
> 
> - This approach does not allow creation of any "bad" trees.
> 
>  > Can they create loops in mount propagation trees that we can not 
> create today?
> 
> There would be no loops in "sharing groups tree" for sure, as this new 
> MS_SET_GROUP only adds one _private_ mount to one group (without moving 
> between groups), the tree itself is unchanged after mount(MS_SET_GROUP).
> 
> - Probably mount syscall is not the right place for MS_SET_GROUP. I see 
> new syscall mount_setattr, first I thought reworking MS_SET_GROUP to be 
> a part of it, but interface of mount_setattr for copying is not 
> convenient. Probably we can add MS_SET_GROUP flag to move_mount which 
> has exactly what we want path to source and destination relative to fd:
> 
> static inline int move_mount(int from_dfd, const char *from_pathname,
>                               int to_dfd, const char *to_pathname,
>                               unsigned int flags)
> 
> As in mount-v2 now I had to use proc hacks to access mounts at dfd:
> 
> https://github.com/Snorch/criu/blob/bebbded98128ec787950fa8365a6c74ce6a3b2cb/criu/mount-v2.c#L923 
> 
> 
> - I hope that we still have a chance for MS_SET_GROUP, this way we can 
> port mount-v2 to mainstream CRIU.
> 

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


More information about the CRIU mailing list