[Devel] [PATCH RHEL7 COMMIT] mnt: allow to add a mount into an existing group
Konstantin Khorenko
khorenko at virtuozzo.com
Fri May 8 17:05:34 MSK 2020
The commit is pushed to "branch-rh7-3.10.0-1127.vz7.160.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-1127.vz7.150.9
------>
commit a85d8068227eaf473e58a2161e7dd449239b7c24
Author: Andrei Vagin <avagin at openvz.org>
Date: Fri May 8 17:05:34 2020 +0300
mnt: allow to add a mount into an existing group
Now a shared group can be only inherited from a source mount.
This patch adds an ability to add a mount into an existing shared
group.
mount(source, target, NULL, MS_SET_GROUP, NULL)
mount() with the MS_SET_GROUP flag adds the "target" mount into a group
of the "source" mount. The calling process has to have the CAP_SYS_ADMIN
capability in namespaces of these mounts. The source and the target
mounts have to have the same super block.
This new functionality together with "mnt: Tuck mounts under others
instead of creating shadow/side mounts." allows CRIU to dump and restore
any set of mount namespaces.
Currently we have a lot of issues about dumping and restoring mount
namespaces. The bigest problem is that we can't construct mount trees
directly due to several reasons:
* groups can't be set, they can be only inherited
* file systems has to be mounted from the specified user namespaces
* the mount() syscall doesn't just create one mount -- the mount is
also propagated to all members of a parent group
* umount() doesn't detach mounts from all members of a group
(mounts with children are not umounted)
* mounts are propagated underneath of existing mounts
* mount() doesn't allow to make bind-mounts between two namespaces
* processes can have opened file descriptors to overmounted files
All these operations are non-trivial, making the task of restoring
a mount namespace practically unsolvable for reasonable time. The
proposed change allows to restore a mount namespace in a direct
manner, without any super complex logic.
Cc: Eric W. Biederman <ebiederm at xmission.com>
Cc: Alexander Viro <viro at zeniv.linux.org.uk>
Signed-off-by: Andrei Vagin <avagin at openvz.org>
Patch hangs long in lkml without much review:
https://patchwork.kernel.org/patch/9703885/
But with it we can implement correct mounts restore in vzcriu much
easier.
Add some restrictions: a) prohibit setting group on non-mnt_root dentry;
b) prohibit destination mount to be in non-current mntns; c) only super
or pseudosuper ve can set group.
https://jira.sw.ru/browse/PSBM-58617
Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
> Is it OK to have the flag's semantics overloaded?
1) do_mount is only called from syscall:
sys_osf_mount
osf_ufs_mount
osf_procfs_mount
osf_cdfs_mount
do_mount
compat_sys_mount
do_mount
sys_mount
do_mount
2) previousely MS_SUBMOUNT was explicitly ignored in vz7 in do_mount
because it is kernel internal flag:
flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
MS_STRICTATIME | MS_NOREMOTELOCK | MS_SUBMOUNT);
in ms and vz8 it is a bit more complex but still ignored. Because it is
kernel internal flag and userspace can't set it.
If we add MS_SET_GROUP with same number as MS_SUBMOUNT but only check it
in do_mount where it was previousely ignored it looks OK to me.
---
fs/namespace.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++---
include/uapi/linux/fs.h | 6 ++++
2 files changed, 80 insertions(+), 4 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 306d6b61ffdd5..46cda75f0b99a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2684,6 +2684,69 @@ static inline int tree_contains_unbindable(struct mount *mnt)
return 0;
}
+static int do_set_group(struct path *path, const char *sibling_name)
+{
+ struct ve_struct *ve = get_exec_env();
+ struct mount *sibling, *mnt;
+ struct path sibling_path;
+ int err;
+
+ if (!ve_is_super(ve) && !ve->is_pseudosuper)
+ return -EPERM;
+
+ if (!sibling_name || !*sibling_name)
+ return -EINVAL;
+
+ if (path->dentry != path->mnt->mnt_root)
+ return -EINVAL;
+
+ err = kern_path(sibling_name, LOOKUP_FOLLOW, &sibling_path);
+ if (err)
+ return err;
+
+ err = -EINVAL;
+ if (sibling_path.dentry != sibling_path.mnt->mnt_root)
+ goto out_put;
+
+ sibling = real_mount(sibling_path.mnt);
+ mnt = real_mount(path->mnt);
+
+ if (!check_mnt(mnt))
+ goto out_put;
+
+ namespace_lock();
+
+ err = -EPERM;
+ if (!sibling->mnt_ns ||
+ !ns_capable(sibling->mnt_ns->user_ns, CAP_SYS_ADMIN))
+ goto out_unlock;
+
+ err = -EINVAL;
+ if (sibling->mnt.mnt_sb != mnt->mnt.mnt_sb)
+ goto out_unlock;
+
+ if (IS_MNT_SHARED(mnt) || IS_MNT_SLAVE(mnt))
+ goto out_unlock;
+
+ if (IS_MNT_SLAVE(sibling)) {
+ list_add(&mnt->mnt_slave, &sibling->mnt_slave);
+ mnt->mnt_master = sibling->mnt_master;
+ }
+
+ if (IS_MNT_SHARED(sibling)) {
+ mnt->mnt_group_id = sibling->mnt_group_id;
+ list_add(&mnt->mnt_share, &sibling->mnt_share);
+ set_mnt_shared(mnt);
+ }
+
+ err = 0;
+out_unlock:
+ namespace_unlock();
+out_put:
+ path_put(&sibling_path);
+ return err;
+}
+
static int do_move_mount(struct path *path, const char *old_name)
{
struct path old_path, parent_path;
@@ -3098,6 +3161,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
struct path path;
int retval = 0;
int mnt_flags = 0;
+ unsigned long cmd;
/* Discard magic */
if ((flags & MS_MGC_MSK) == MS_MGC_VAL)
@@ -3149,19 +3213,25 @@ long do_mount(const char *dev_name, const char __user *dir_name,
mnt_flags |= path.mnt->mnt_flags & MNT_ATIME_MASK;
}
+ cmd = flags & (MS_REMOUNT | MS_BIND |
+ MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE |
+ MS_MOVE | MS_SET_GROUP);
+
flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
MS_STRICTATIME | MS_NOREMOTELOCK | MS_SUBMOUNT);
- if (flags & MS_REMOUNT)
+ if (cmd & MS_REMOUNT)
retval = do_remount(&path, flags & ~MS_REMOUNT, mnt_flags,
data_page);
- else if (flags & MS_BIND)
+ else if (cmd & MS_BIND)
retval = do_loopback(&path, dev_name, flags & MS_REC);
- else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
+ else if (cmd & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
retval = do_change_type(&path, flags);
- else if (flags & MS_MOVE)
+ else if (cmd & MS_MOVE)
retval = do_move_mount(&path, dev_name);
+ else if (cmd & MS_SET_GROUP)
+ retval = do_set_group(&path, dev_name);
else
retval = do_new_mount(&path, type_page, flags, mnt_flags,
dev_name, data_page);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index a4ca54304220f..45e2752a85c9c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -99,6 +99,12 @@ struct inodes_stat_t {
#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */
+/*
+ * Here are commands and flags. Commands are handled in do_mount()
+ * and can intersect with kernel internal flags.
+ */
+#define MS_SET_GROUP (1<<26) /* Add a mount into a shared group */
+
/* These sb flags are internal to the kernel */
#define MS_SUBMOUNT (1<<26)
#define MS_NOREMOTELOCK (1<<27)
More information about the Devel
mailing list