[Devel] [PATCH RHEL COMMIT] mnt: allow to add a mount into an existing group

Konstantin Khorenko khorenko at virtuozzo.com
Mon Oct 4 17:02:19 MSK 2021


The commit is pushed to "branch-rh9-5.14.vz9.1.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after ark-5.14
------>
commit 5bf218e2fbe4d3345d31b56df02564cd8cbb38da
Author: Andrei Vagin <avagin at openvz.org>
Date:   Mon Oct 4 17:02:19 2021 +0300

    mnt: allow to add a mount into an existing group
    
    Now a shared group can be only inherited from a source mount.
    This patch adds an ability to add a mount into an existing shared
    group.
    
    mount(source, target, NULL, MS_SET_GROUP, NULL)
    
    mount() with the MS_SET_GROUP flag adds the "target" mount into a group
    of the "source" mount. The calling process has to have the CAP_SYS_ADMIN
    capability in namespaces of these mounts. The source and the target
    mounts have to have the same super block.
    
    This new functionality together with "mnt: Tuck mounts under others
    instead of creating shadow/side mounts." allows CRIU to dump and restore
    any set of mount namespaces.
    
    Currently we have a lot of issues about dumping and restoring mount
    namespaces. The bigest problem is that we can't construct mount trees
    directly due to several reasons:
    * groups can't be set, they can be only inherited
    * file systems has to be mounted from the specified user namespaces
    * the mount() syscall doesn't just create one mount -- the mount is
      also propagated to all members of a parent group
    * umount() doesn't detach mounts from all members of a group
      (mounts with children are not umounted)
    * mounts are propagated underneath of existing mounts
    * mount() doesn't allow to make bind-mounts between two namespaces
    * processes can have opened file descriptors to overmounted files
    
    All these operations are non-trivial, making the task of restoring
    a mount namespace practically unsolvable for reasonable time. The
    proposed change allows to restore a mount namespace in a direct
    manner, without any super complex logic.
    
    Cc: Eric W. Biederman <ebiederm at xmission.com>
    Cc: Alexander Viro <viro at zeniv.linux.org.uk>
    Signed-off-by: Andrei Vagin <avagin at openvz.org>
    
    Patch hangs long in lkml without much review:
    https://patchwork.kernel.org/patch/9703885/
    
    But with it we can implement correct mounts restore in vzcriu much
    easier.
    
    Add some restrictions: a) prohibit setting group on non-mnt_root dentry;
    b) prohibit destination mount to be in non-current mntns; c) only super
    or pseudosuper ve can set group.
    
    https://jira.sw.ru/browse/PSBM-58617
    
    Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
    
    > Is it OK to have the flag's semantics overloaded?
    1) do_mount is only called from syscall:
    
    sys_osf_mount
       osf_ufs_mount
       osf_procfs_mount
       osf_cdfs_mount
         do_mount
    
    compat_sys_mount
       do_mount
    
    sys_mount
       do_mount
    
    2) previousely MS_SUBMOUNT was explicitly ignored in vz7 in do_mount
    because it is kernel internal flag:
    
    flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
       MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
       MS_STRICTATIME | MS_NOREMOTELOCK | MS_SUBMOUNT);
    
    in ms and vz8 it is a bit more complex but still ignored. Because it is
    kernel internal flag and userspace can't set it.
    
    If we add MS_SET_GROUP with same number as MS_SUBMOUNT but only check it
    in do_mount where it was previousely ignored it looks OK to me.
    
    +++
    mnt: relax the restrictions of MS_SET_GROUP
    
    >From the first glance it looked nice to check that the source path from
    which we wan't to copy sharing is root of it's mount to make interface
    more predictable. But it appeared there is a pain for external mount
    restore and for ct root mount restore to lookup actuall mount path in
    host mount namespace instead of just relying on a path to subdirectory
    on this mount which is already given to us by user.
    
    For instance when we do bind-mounts for these root and external mounts
    we use subdirectory as a source and it's ok.
    
    Also from the first glance it looked nice to only allow to set sharing
    for a mount in current mntns. But there is also a pain for criu because
    we can have many mounts with the same shared_id and master_id (from same
    sharing group) in different mount namespaces, and in the worst case we
    would need to do extra setns for each mount which is a pure waste of
    resources. So let's allow copying sharing options even if (current mntns
    != source mntns != destination mntns) all namespces are different
    (note: mounts from alien mntns can be accessed through /proc/pid/fd/id).
    
    https://jira.sw.ru/browse/PSBM-58617
    
    note: applies both to vz7 and vz8.
    
    mFixes: ("mnt: allow to add a mount into an existing group")
    
    Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
    
    (cherry picked from vz8 commit 5bbdf2e271b788470197439732ffd6983b7294c8)
    Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>
---
 fs/namespace.c             | 58 ++++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/mount.h |  6 +++++
 2 files changed, 64 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index bc2819c49315..813e3a172ecb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2853,6 +2853,62 @@ static bool check_for_nsfs_mounts(struct mount *subtree)
 	return ret;
 }
 
+static int do_set_group(struct path *path, const char *sibling_name)
+{
+	struct ve_struct *ve = get_exec_env();
+	struct mount *sibling, *mnt;
+	struct path sibling_path;
+	int err;
+
+	if (!ve_is_super(ve) && !ve->is_pseudosuper)
+		return -EPERM;
+
+	if (!sibling_name || !*sibling_name)
+		return -EINVAL;
+
+	if (path->dentry != path->mnt->mnt_root)
+		return -EINVAL;
+
+	err = kern_path(sibling_name, LOOKUP_FOLLOW, &sibling_path);
+	if (err)
+		return err;
+
+	err = -EINVAL;
+	sibling = real_mount(sibling_path.mnt);
+	mnt = real_mount(path->mnt);
+
+	namespace_lock();
+
+	err = -EPERM;
+	if (!sibling->mnt_ns ||
+	    !ns_capable(sibling->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		goto out_unlock;
+
+	err = -EINVAL;
+	if (sibling->mnt.mnt_sb != mnt->mnt.mnt_sb)
+		goto out_unlock;
+
+	if (IS_MNT_SHARED(mnt) || IS_MNT_SLAVE(mnt))
+		goto out_unlock;
+
+	if (IS_MNT_SLAVE(sibling)) {
+		list_add(&mnt->mnt_slave, &sibling->mnt_slave);
+		mnt->mnt_master = sibling->mnt_master;
+	}
+
+	if (IS_MNT_SHARED(sibling)) {
+		mnt->mnt_group_id = sibling->mnt_group_id;
+		list_add(&mnt->mnt_share, &sibling->mnt_share);
+		set_mnt_shared(mnt);
+	}
+
+	err = 0;
+out_unlock:
+	namespace_unlock();
+	path_put(&sibling_path);
+	return err;
+}
+
 static int do_move_mount(struct path *old_path, struct path *new_path)
 {
 	struct mnt_namespace *ns;
@@ -3400,6 +3456,8 @@ int path_mount(const char *dev_name, struct path *path,
 		return do_change_type(path, flags);
 	if (flags & MS_MOVE)
 		return do_move_mount_old(path, dev_name);
+	if (flags & MS_SET_GROUP)
+		return do_set_group(path, dev_name);
 
 	return do_new_mount(path, type_page, sb_flags, mnt_flags, dev_name,
 			    data_page);
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index dd7a166fdf9c..9a63206be317 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -38,6 +38,12 @@
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
 #define MS_LAZYTIME	(1<<25) /* Update the on-disk [acm]times lazily */
 
+/*
+ * Here are commands and flags. Commands are handled in do_mount()
+ * and can intersect with kernel internal flags.
+ */
+#define MS_SET_GROUP   (1<<26) /* Add a mount into a shared group */
+
 /* These sb flags are internal to the kernel */
 #define MS_SUBMOUNT     (1<<26)
 #define MS_NOREMOTELOCK	(1<<27)


More information about the Devel mailing list