[Devel] [PATCH RH9 01/14] mnt: allow to add a mount into an existing group

Andrey Zhadchenko andrey.zhadchenko at virtuozzo.com
Mon Oct 4 12:17:16 MSK 2021


From: Andrei Vagin <avagin at openvz.org>

Now a shared group can be only inherited from a source mount.
This patch adds an ability to add a mount into an existing shared
group.

mount(source, target, NULL, MS_SET_GROUP, NULL)

mount() with the MS_SET_GROUP flag adds the "target" mount into a group
of the "source" mount. The calling process has to have the CAP_SYS_ADMIN
capability in namespaces of these mounts. The source and the target
mounts have to have the same super block.

This new functionality together with "mnt: Tuck mounts under others
instead of creating shadow/side mounts." allows CRIU to dump and restore
any set of mount namespaces.

Currently we have a lot of issues about dumping and restoring mount
namespaces. The bigest problem is that we can't construct mount trees
directly due to several reasons:
* groups can't be set, they can be only inherited
* file systems has to be mounted from the specified user namespaces
* the mount() syscall doesn't just create one mount -- the mount is
  also propagated to all members of a parent group
* umount() doesn't detach mounts from all members of a group
  (mounts with children are not umounted)
* mounts are propagated underneath of existing mounts
* mount() doesn't allow to make bind-mounts between two namespaces
* processes can have opened file descriptors to overmounted files

All these operations are non-trivial, making the task of restoring
a mount namespace practically unsolvable for reasonable time. The
proposed change allows to restore a mount namespace in a direct
manner, without any super complex logic.

Cc: Eric W. Biederman <ebiederm at xmission.com>
Cc: Alexander Viro <viro at zeniv.linux.org.uk>
Signed-off-by: Andrei Vagin <avagin at openvz.org>

Patch hangs long in lkml without much review:
https://patchwork.kernel.org/patch/9703885/

But with it we can implement correct mounts restore in vzcriu much
easier.

Add some restrictions: a) prohibit setting group on non-mnt_root dentry;
b) prohibit destination mount to be in non-current mntns; c) only super
or pseudosuper ve can set group.

https://jira.sw.ru/browse/PSBM-58617

Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>

> Is it OK to have the flag's semantics overloaded?
1) do_mount is only called from syscall:

sys_osf_mount
   osf_ufs_mount
   osf_procfs_mount
   osf_cdfs_mount
     do_mount

compat_sys_mount
   do_mount

sys_mount
   do_mount

2) previousely MS_SUBMOUNT was explicitly ignored in vz7 in do_mount
because it is kernel internal flag:

flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
   MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
   MS_STRICTATIME | MS_NOREMOTELOCK | MS_SUBMOUNT);

in ms and vz8 it is a bit more complex but still ignored. Because it is
kernel internal flag and userspace can't set it.

If we add MS_SET_GROUP with same number as MS_SUBMOUNT but only check it
in do_mount where it was previousely ignored it looks OK to me.

+++
mnt: relax the restrictions of MS_SET_GROUP

>From the first glance it looked nice to check that the source path from
which we wan't to copy sharing is root of it's mount to make interface
more predictable. But it appeared there is a pain for external mount
restore and for ct root mount restore to lookup actuall mount path in
host mount namespace instead of just relying on a path to subdirectory
on this mount which is already given to us by user.

For instance when we do bind-mounts for these root and external mounts
we use subdirectory as a source and it's ok.

Also from the first glance it looked nice to only allow to set sharing
for a mount in current mntns. But there is also a pain for criu because
we can have many mounts with the same shared_id and master_id (from same
sharing group) in different mount namespaces, and in the worst case we
would need to do extra setns for each mount which is a pure waste of
resources. So let's allow copying sharing options even if (current mntns
!= source mntns != destination mntns) all namespces are different
(note: mounts from alien mntns can be accessed through /proc/pid/fd/id).

https://jira.sw.ru/browse/PSBM-58617

note: applies both to vz7 and vz8.

mFixes: ("mnt: allow to add a mount into an existing group")

Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>

(cherry picked from vz8 commit 5bbdf2e271b788470197439732ffd6983b7294c8)
Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>
---
 fs/namespace.c             | 58 ++++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/mount.h |  6 +++++
 2 files changed, 64 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 954238c..1725434 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2705,6 +2705,62 @@ static bool check_for_nsfs_mounts(struct mount *subtree)
 	return ret;
 }
 
+static int do_set_group(struct path *path, const char *sibling_name)
+{
+	struct ve_struct *ve = get_exec_env();
+	struct mount *sibling, *mnt;
+	struct path sibling_path;
+	int err;
+
+	if (!ve_is_super(ve) && !ve->is_pseudosuper)
+		return -EPERM;
+
+	if (!sibling_name || !*sibling_name)
+		return -EINVAL;
+
+	if (path->dentry != path->mnt->mnt_root)
+		return -EINVAL;
+
+	err = kern_path(sibling_name, LOOKUP_FOLLOW, &sibling_path);
+	if (err)
+		return err;
+
+	err = -EINVAL;
+	sibling = real_mount(sibling_path.mnt);
+	mnt = real_mount(path->mnt);
+
+	namespace_lock();
+
+	err = -EPERM;
+	if (!sibling->mnt_ns ||
+	    !ns_capable(sibling->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		goto out_unlock;
+
+	err = -EINVAL;
+	if (sibling->mnt.mnt_sb != mnt->mnt.mnt_sb)
+		goto out_unlock;
+
+	if (IS_MNT_SHARED(mnt) || IS_MNT_SLAVE(mnt))
+		goto out_unlock;
+
+	if (IS_MNT_SLAVE(sibling)) {
+		list_add(&mnt->mnt_slave, &sibling->mnt_slave);
+		mnt->mnt_master = sibling->mnt_master;
+	}
+
+	if (IS_MNT_SHARED(sibling)) {
+		mnt->mnt_group_id = sibling->mnt_group_id;
+		list_add(&mnt->mnt_share, &sibling->mnt_share);
+		set_mnt_shared(mnt);
+	}
+
+	err = 0;
+out_unlock:
+	namespace_unlock();
+	path_put(&sibling_path);
+	return err;
+}
+
 static int do_move_mount(struct path *old_path, struct path *new_path)
 {
 	struct mnt_namespace *ns;
@@ -3252,6 +3308,8 @@ int path_mount(const char *dev_name, struct path *path,
 		return do_change_type(path, flags);
 	if (flags & MS_MOVE)
 		return do_move_mount_old(path, dev_name);
+	if (flags & MS_SET_GROUP)
+		return do_set_group(path, dev_name);
 
 	return do_new_mount(path, type_page, sb_flags, mnt_flags, dev_name,
 			    data_page);
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index dd7a166..9a63206 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -38,6 +38,12 @@
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
 #define MS_LAZYTIME	(1<<25) /* Update the on-disk [acm]times lazily */
 
+/*
+ * Here are commands and flags. Commands are handled in do_mount()
+ * and can intersect with kernel internal flags.
+ */
+#define MS_SET_GROUP   (1<<26) /* Add a mount into a shared group */
+
 /* These sb flags are internal to the kernel */
 #define MS_SUBMOUNT     (1<<26)
 #define MS_NOREMOTELOCK	(1<<27)
-- 
1.8.3.1



More information about the Devel mailing list