[Devel] [PATCH RHEL7 COMMIT] ms/fs: add mount_setattr()

Konstantin Khorenko khorenko at virtuozzo.com
Thu Apr 20 21:00:50 MSK 2023


The commit is pushed to "branch-rh7-3.10.0-1160.88.1.vz7.195.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-1160.88.1.vz7.195.2
------>
commit 8963de1e27323792a9a3c801d8151c326f3f0b59
Author: Christian Brauner <christian.brauner at ubuntu.com>
Date:   Thu Apr 13 18:47:23 2023 +0800

    ms/fs: add mount_setattr()
    
    This implements the missing mount_setattr() syscall. While the new mount
    api allows to change the properties of a superblock there is currently
    no way to change the properties of a mount or a mount tree using file
    descriptors which the new mount api is based on. In addition the old
    mount api has the restriction that mount options cannot be applied
    recursively. This hasn't changed since changing mount options on a
    per-mount basis was implemented in [1] and has been a frequent request
    not just for convenience but also for security reasons. The legacy
    mount syscall is unable to accommodate this behavior without introducing
    a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
    MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
    mount. Changing MS_REC to apply to the whole mount tree would mean
    introducing a significant uapi change and would likely cause significant
    regressions.
    
    The new mount_setattr() syscall allows to recursively clear and set
    mount options in one shot. Multiple calls to change mount options
    requesting the same changes are idempotent:
    
    int mount_setattr(int dfd, const char *path, unsigned flags,
                      struct mount_attr *uattr, size_t usize);
    
    Flags to modify path resolution behavior are specified in the @flags
    argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
    and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
    restrict path resolution as introduced with openat2() might be supported
    in the future.
    
    The mount_setattr() syscall can be expected to grow over time and is
    designed with extensibility in mind. It follows the extensible syscall
    pattern we have used with other syscalls such as openat2(), clone3(),
    sched_{set,get}attr(), and others.
    The set of mount options is passed in the uapi struct mount_attr which
    currently has the following layout:
    
    struct mount_attr {
            __u64 attr_set;
            __u64 attr_clr;
            __u64 propagation;
            __u64 userns_fd;
    };
    
    The @attr_set and @attr_clr members are used to clear and set mount
    options. This way a user can e.g. request that a set of flags is to be
    raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
    @attr_set while at the same time requesting that another set of flags is
    to be lowered such as removing noexec from a mount tree by specifying
    MOUNT_ATTR_NOEXEC in @attr_clr.
    
    Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0,
    not a bitmap, users wanting to transition to a different atime setting
    cannot simply specify the atime setting in @attr_set, but must also
    specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
    MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
    can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
    @attr_clr.
    
    The @propagation field lets callers specify the propagation type of a
    mount tree. Propagation is a single property that has four different
    settings and as such is not really a flag argument but an enum.
    Specifically, it would be unclear what setting and clearing propagation
    settings in combination would amount to. The legacy mount() syscall thus
    forbids the combination of multiple propagation settings too. The goal
    is to keep the semantics of mount propagation somewhat simple as they
    are overly complex as it is.
    
    The @userns_fd field lets user specify a user namespace whose idmapping
    becomes the idmapping of the mount. This is implemented and explained in
    detail in the next patch.
    
    [1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
    
    Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com
    Cc: David Howells <dhowells at redhat.com>
    Cc: Aleksa Sarai <cyphar at cyphar.com>
    Cc: Al Viro <viro at zeniv.linux.org.uk>
    Cc: linux-fsdevel at vger.kernel.org
    Cc: linux-api at vger.kernel.org
    Reviewed-by: Christoph Hellwig <hch at lst.de>
    Signed-off-by: Christian Brauner <christian.brauner at ubuntu.com>
    
    Changes: port syscall for x86 only, drop uapi hunks and ignore
    MNT_NOSYMFOLLOW as it is not yet supported
    (cherry picked from commit 2a1867219c7b27f928e2545782b86daaf9ad50bd)
    https://jira.sw.ru/browse/PSBM-144416
    Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
    
    =================
    Patchset description:
    mount: Port move_mount_set_group and mount_setattr
    
    We need this as in Virtuozzo criu after rebase to mainstream criu in u20
    we will switch to this new API for sharing group setting accross mounts.
    
    https://jira.vzint.dev/browse/PSBM-144416
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/namespace.c                   | 312 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 314 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 978f07cb0ea1..7978137648ed 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -373,6 +373,7 @@
 
 428	i386	open_tree		sys_open_tree
 429	i386	move_mount		sys_move_mount
+442	i386	mount_setattr		sys_mount_setattr
 
 510	i386	getluid			sys_getluid
 511	i386	setluid			sys_setluid
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 3c86abac9a50..7175e94309fd 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -338,6 +338,7 @@
 
 428	common	open_tree		sys_open_tree
 429	common	move_mount		sys_move_mount
+442	common	mount_setattr		sys_mount_setattr
 
 500	64	getluid			sys_getluid
 501	64	setluid			sys_setluid
diff --git a/fs/namespace.c b/fs/namespace.c
index a40a217f9871..f37cae055dbf 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -4094,6 +4094,318 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
 	return error;
 }
 
+/*
+ * mount_setattr()
+ */
+struct mount_attr {
+	__u64 attr_set;
+	__u64 attr_clr;
+	__u64 propagation;
+	__u64 userns_fd;
+};
+
+struct mount_kattr {
+	unsigned int attr_set;
+	unsigned int attr_clr;
+	unsigned int propagation;
+	unsigned int lookup_flags;
+	bool recurse;
+};
+
+/* List of all mount_attr versions. */
+#define MOUNT_ATTR_SIZE_VER0   32 /* sizeof first published struct */
+
+/*
+ * Mount attributes.
+ */
+#define MOUNT_ATTR_RDONLY       0x00000001 /* Mount read-only */
+#define MOUNT_ATTR_NOSUID       0x00000002 /* Ignore suid and sgid bits */
+#define MOUNT_ATTR_NODEV        0x00000004 /* Disallow access to device special files */
+#define MOUNT_ATTR_NOEXEC       0x00000008 /* Disallow program execution */
+#define MOUNT_ATTR__ATIME       0x00000070 /* Setting on how atime should be updated */
+#define MOUNT_ATTR_RELATIME     0x00000000 /* - Update atime relative to mtime/ctime. */
+#define MOUNT_ATTR_NOATIME      0x00000010 /* - Do not update access times. */
+#define MOUNT_ATTR_STRICTATIME  0x00000020 /* - Always perform atime updates */
+#define MOUNT_ATTR_NODIRATIME   0x00000080 /* Do not update directory access times */
+#define MOUNT_ATTR_IDMAP        0x00100000 /* Idmap mount to @userns_fd in struct mount_attr. */
+#define MOUNT_ATTR_NOSYMFOLLOW  0x00200000 /* Do not follow symlinks */
+
+#define MOUNT_SETATTR_VALID_FLAGS \
+(MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID | MOUNT_ATTR_NODEV | \
+ MOUNT_ATTR_NOEXEC | MOUNT_ATTR__ATIME | MOUNT_ATTR_NODIRATIME | \
+ MOUNT_ATTR_NOSYMFOLLOW)
+
+#define MOUNT_SETATTR_PROPAGATION_FLAGS \
+(MS_UNBINDABLE | MS_PRIVATE | MS_SLAVE | MS_SHARED)
+
+static unsigned int attr_flags_to_mnt_flags(u64 attr_flags)
+{
+	unsigned int mnt_flags = 0;
+
+	if (attr_flags & MOUNT_ATTR_RDONLY)
+		mnt_flags |= MNT_READONLY;
+	if (attr_flags & MOUNT_ATTR_NOSUID)
+		mnt_flags |= MNT_NOSUID;
+	if (attr_flags & MOUNT_ATTR_NODEV)
+		mnt_flags |= MNT_NODEV;
+	if (attr_flags & MOUNT_ATTR_NOEXEC)
+		mnt_flags |= MNT_NOEXEC;
+	if (attr_flags & MOUNT_ATTR_NODIRATIME)
+		mnt_flags |= MNT_NODIRATIME;
+
+	return mnt_flags;
+}
+
+static unsigned int recalc_flags(struct mount_kattr *kattr, struct mount *mnt)
+{
+	unsigned int flags = mnt->mnt.mnt_flags;
+
+	/*  flags to clear */
+	flags &= ~kattr->attr_clr;
+	/* flags to raise */
+	flags |= kattr->attr_set;
+
+	return flags;
+}
+
+static struct mount *mount_setattr_prepare(struct mount_kattr *kattr,
+					   struct mount *mnt, int *err)
+{
+	struct mount *m = mnt, *last = NULL;
+
+	if (!is_mounted(&m->mnt)) {
+		*err = -EINVAL;
+		goto out;
+	}
+
+	if (!(mnt_has_parent(m) ? check_mnt(m) : is_anon_ns(m->mnt_ns))) {
+		*err = -EINVAL;
+		goto out;
+	}
+
+	do {
+		unsigned int flags;
+
+		flags = recalc_flags(kattr, m);
+		if (!can_change_locked_flags(m, flags)) {
+			*err = -EPERM;
+			goto out;
+		}
+
+		last = m;
+
+		if ((kattr->attr_set & MNT_READONLY) &&
+		    !(m->mnt.mnt_flags & MNT_READONLY)) {
+			*err = mnt_hold_writers(m);
+			if (*err)
+				goto out;
+		}
+	} while (kattr->recurse && (m = next_mnt(m, mnt)));
+
+out:
+	return last;
+}
+
+static void mount_setattr_commit(struct mount_kattr *kattr,
+				 struct mount *mnt, struct mount *last,
+				 int err)
+{
+	struct mount *m = mnt;
+
+	do {
+		if (!err) {
+			unsigned int flags;
+
+			flags = recalc_flags(kattr, m);
+			WRITE_ONCE(m->mnt.mnt_flags, flags);
+		}
+
+		/*
+		 * We either set MNT_READONLY above so make it visible
+		 * before ~MNT_WRITE_HOLD or we failed to recursively
+		 * apply mount options.
+		 */
+		if ((kattr->attr_set & MNT_READONLY) &&
+		    (m->mnt.mnt_flags & MNT_WRITE_HOLD))
+			mnt_unhold_writers(m);
+
+		if (!err && kattr->propagation)
+			change_mnt_propagation(m, kattr->propagation);
+
+		/*
+		 * On failure, only cleanup until we found the first mount
+		 * we failed to handle.
+		 */
+		if (err && m == last)
+			break;
+	} while (kattr->recurse && (m = next_mnt(m, mnt)));
+
+	if (!err)
+		touch_mnt_namespace(mnt->mnt_ns);
+}
+
+static int do_mount_setattr(struct path *path, struct mount_kattr *kattr)
+{
+	struct mount *mnt = real_mount(path->mnt), *last = NULL;
+	int err = 0;
+
+	if (path->dentry != mnt->mnt.mnt_root)
+		return -EINVAL;
+
+	if (kattr->propagation) {
+		/*
+		 * Only take namespace_lock() if we're actually changing
+		 * propagation.
+		 */
+		namespace_lock();
+		if (kattr->propagation == MS_SHARED) {
+			err = invent_group_ids(mnt, kattr->recurse);
+			if (err) {
+				namespace_unlock();
+				return err;
+			}
+		}
+	}
+
+	lock_mount_hash();
+
+	/*
+	 * Get the mount tree in a shape where we can change mount
+	 * properties without failure.
+	 */
+	last = mount_setattr_prepare(kattr, mnt, &err);
+	if (last) /* Commit all changes or revert to the old state. */
+		mount_setattr_commit(kattr, mnt, last, err);
+
+	unlock_mount_hash();
+
+	if (kattr->propagation) {
+		namespace_unlock();
+		if (err)
+			cleanup_group_ids(mnt, NULL);
+	}
+
+	return err;
+}
+
+static int build_mount_kattr(const struct mount_attr *attr,
+			     struct mount_kattr *kattr, unsigned int flags)
+{
+	unsigned int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
+
+	if (flags & AT_NO_AUTOMOUNT)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (flags & AT_SYMLINK_NOFOLLOW)
+		lookup_flags &= ~LOOKUP_FOLLOW;
+	if (flags & AT_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+
+	*kattr = (struct mount_kattr) {
+		.lookup_flags	= lookup_flags,
+		.recurse	= !!(flags & AT_RECURSIVE),
+	};
+
+	if (attr->propagation & ~MOUNT_SETATTR_PROPAGATION_FLAGS)
+		return -EINVAL;
+	if (hweight32(attr->propagation & MOUNT_SETATTR_PROPAGATION_FLAGS) > 1)
+		return -EINVAL;
+	kattr->propagation = attr->propagation;
+
+	if ((attr->attr_set | attr->attr_clr) & ~MOUNT_SETATTR_VALID_FLAGS)
+		return -EINVAL;
+
+	if (attr->userns_fd)
+		return -EINVAL;
+
+	kattr->attr_set = attr_flags_to_mnt_flags(attr->attr_set);
+	kattr->attr_clr = attr_flags_to_mnt_flags(attr->attr_clr);
+
+	/*
+	 * Since the MOUNT_ATTR_<atime> values are an enum, not a bitmap,
+	 * users wanting to transition to a different atime setting cannot
+	 * simply specify the atime setting in @attr_set, but must also
+	 * specify MOUNT_ATTR__ATIME in the @attr_clr field.
+	 * So ensure that MOUNT_ATTR__ATIME can't be partially set in
+	 * @attr_clr and that @attr_set can't have any atime bits set if
+	 * MOUNT_ATTR__ATIME isn't set in @attr_clr.
+	 */
+	if (attr->attr_clr & MOUNT_ATTR__ATIME) {
+		if ((attr->attr_clr & MOUNT_ATTR__ATIME) != MOUNT_ATTR__ATIME)
+			return -EINVAL;
+
+		/*
+		 * Clear all previous time settings as they are mutually
+		 * exclusive.
+		 */
+		kattr->attr_clr |= MNT_RELATIME | MNT_NOATIME;
+		switch (attr->attr_set & MOUNT_ATTR__ATIME) {
+		case MOUNT_ATTR_RELATIME:
+			kattr->attr_set |= MNT_RELATIME;
+			break;
+		case MOUNT_ATTR_NOATIME:
+			kattr->attr_set |= MNT_NOATIME;
+			break;
+		case MOUNT_ATTR_STRICTATIME:
+			break;
+		default:
+			return -EINVAL;
+		}
+	} else {
+		if (attr->attr_set & MOUNT_ATTR__ATIME)
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+SYSCALL_DEFINE5(mount_setattr, int, dfd, const char __user *, path,
+		unsigned int, flags, struct mount_attr __user *, uattr,
+		size_t, usize)
+{
+	int err;
+	struct path target;
+	struct mount_attr attr;
+	struct mount_kattr kattr;
+
+	BUILD_BUG_ON(sizeof(struct mount_attr) != MOUNT_ATTR_SIZE_VER0);
+
+	if (flags & ~(AT_EMPTY_PATH |
+		      AT_RECURSIVE |
+		      AT_SYMLINK_NOFOLLOW |
+		      AT_NO_AUTOMOUNT))
+		return -EINVAL;
+
+	if (unlikely(usize > PAGE_SIZE))
+		return -E2BIG;
+	if (unlikely(usize < MOUNT_ATTR_SIZE_VER0))
+		return -EINVAL;
+
+	if (!may_mount())
+		return -EPERM;
+
+	err = copy_struct_from_user(&attr, sizeof(attr), uattr, usize);
+	if (err)
+		return err;
+
+	/* Don't bother walking through the mounts if this is a nop. */
+	if (attr.attr_set == 0 &&
+	    attr.attr_clr == 0 &&
+	    attr.propagation == 0)
+		return 0;
+
+	err = build_mount_kattr(&attr, &kattr, flags);
+	if (err)
+		return err;
+
+	err = user_path_at(dfd, path, kattr.lookup_flags, &target);
+	if (err)
+		return err;
+
+	err = do_mount_setattr(&target, &kattr);
+	path_put(&target);
+	return err;
+}
+
 static void __init init_mount_tree(void)
 {
 	struct vfsmount *mnt;


More information about the Devel mailing list