[Devel] [PATCH RHEL9 COMMIT] fsopen/devmnt/fs_context: Add vz devmnt feature support to fs_context

Mon Jun 13 13:27:42 MSK 2022

The commit is pushed to "branch-rh9-5.14.0-70.13.1.vz9.16.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh9-5.14.0-70.13.1.vz9.16.3
------>
commit 3d4ed583f88ad757a162ad278811712f19e33a9c
Author: Alexander Mikhalitsyn <alexander.mikhalitsyn at virtuozzo.com>
Date:   Wed Jun 8 19:24:56 2022 +0300

    fsopen/devmnt/fs_context: Add vz devmnt feature support to fs_context
    
    Let's start with some intro. We have devmnt feature which allows to
    manage (by writing to /sys/fs/cgroup/ve/CTID/ve.mount_opts) which mount
    options are allowed inside the CT and also allows to set e.g. hidden
    options which will be auto-appended to user-defined mount options.
    
    It's used primarily to provide balloon_ino option for ext4 and, from
    near time, xfs filesystems. User is allowed to mount these filesystems
    from inside VE, but of course, user may omit balloon_ino options,
    in this case devmnt appends them automatically on the kernel side.
    
    Some time ago a bunch of new kernel syscalls was added:
    fsopen, fspick, fsconfig, fsmount, move_mount, open_tree [1].
    
    This new kernel API allows userspace to work with detached
    VFS trees and also attach these detached trees to a mount namespace
    (using move_mount() syscall).
    Our current devmnt feature implementation hooks into fs/super.c and
    fs/namespace.c to catch two possible places where user may walk. One
    case is mounting and the second case is remounting.
    
    But after this new fsopen subsystem was appeared
    we got major changes in the in-kernel filesystems API, now
    filesystems may define their own fc->ops->get_tree() callback
    which has to set root-dentry (fc->root) of the future mount.
    
    If a filesystem is not converted to the new API then default
    get_tree callback implementation is used (legacy_get_tree).
    
     1. Before RHEL9 5.14.0-70.13.1.el9_0 kernel ext4 filesystem used
        legacy_get_tree, but now it also has switched to fs_context API.
     2. xfs filesystem uses her-own ->get_tree implementation
        (using generic get_tree_bdev helper inside).
    
    When legacy_get_tree is used, then old-known mount_bdev()
    function is called (and we have corresponding devmnt hook
    inside it!). But if filesystem uses her-own get_tree implementation
    then it utilizes generic get_tree_bdev helper (in the most cases).
    This helper does not contain our hooks.
    
    You may say, what's the problem just add our hook inside this function
    and we will be happy. But that does not work. Why? Because it's too late
    to check mount options in get_tree_bdev() because all options are
    already put inside fs-specific structures and we have no generic way to
    extract them and check.
    
    Another serious difficulty here, is that user may mount fs like this:
    
        fd = fsopen("ext4", FSOPEN_CLOEXEC);
        fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
        fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
        fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
        mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
    
    The problem here is that we get source block device as a usual option,
    and user may, for instance, first specify all mount options and
    provide source block device path at the end of configuring stage.
    
    But devmnt subsystem checks base on block device major & minor
    numbers. It means that we can't just hook fsconfig() and check all
    options on time. We have to postpone this check at some "final stage"
    when we know source blkdev.
    
    Okay, what will we do?
     1. Let's modify fsconfig() syscall implementation, so we will catch
        filesystems of our interest (backed by block device + !ve_is_super)
        and do not paththrough parameters to the filesystem directly but
        just save all user-specified parameters to temporary buffer
        fc->lazy_opts.
     2. On FSCONFIG_CMD_CREATE stage we will take our built mount options
        string fc->lazy_opts and check all parameters using
        ve_devmnt_process() function.
    
    Is this an ideal solution? No. Why:
     1. fsconfig provides filesystem developers a way to get non-standard forms
        of filesystem mount options. For instance, filesystem may get a file
        descriptor, path, binary blob. Of course in such cases we can't just
        concat such data in the fc->lazy_opts. This approach works only if
        filesystem uses classical parameters like flags and strings.
        Fortunately, xfs and ext4 in this set of fs'es.
     2. Not simple changes in the fsopen code
    
    v2: improved init/free logic, fixed a small issue
    v3: small fixes. Thanks to Pavel Tikhomirov for review
    v4: added memcg accounting
    
    https://jira.sw.ru/browse/PSBM-133521
    
    [1] http://kernsec.org/pipermail/linux-security-module-archive/2019-February/011442.html
    
    Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn at virtuozzo.com>
    Reviewed-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
    Cc: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    Feature: fs/mnt: hidden and allowed mount options in CT
---
 fs/fs_context.c            | 229 +++++++++++++++++++++++++++++++++++++++++++++
 fs/fsopen.c                |  32 ++++++-
 include/linux/fs_context.h |  11 +++
 3 files changed, 271 insertions(+), 1 deletion(-)

diff --git a/fs/fs_context.c b/fs/fs_context.c
index 24ce12f0db32..e61da59c7ac0 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -11,6 +11,7 @@
 #include <linux/fs_context.h>
 #include <linux/fs_parser.h>
 #include <linux/fs.h>
+#include <linux/blkdev.h>
 #include <linux/mount.h>
 #include <linux/nsproxy.h>
 #include <linux/slab.h>
@@ -19,6 +20,7 @@
 #include <linux/mnt_namespace.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/ve.h>
 #include <net/net_namespace.h>
 #include <asm/sections.h>
 #include "mount.h"
@@ -160,6 +162,171 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
 }
 EXPORT_SYMBOL(vfs_parse_fs_param);
 
+#ifdef CONFIG_VE
+int __vfs_add_monolithic_fs_param(struct fs_context *fc,
+				  struct fs_parameter *param)
+{
+	char *opts = fc->lazy_opts;
+	size_t free_size;
+	size_t estimated_size = 0;
+
+	BUG_ON(!opts);
+
+	/* lazy_opts length + place for 0-byte */
+	free_size = PAGE_SIZE - (strlen(opts) + 1);
+
+	/* param->key length + sizeof(',') */
+	estimated_size += strlen(param->key) + 1;
+	if (free_size < estimated_size)
+		return -ENOMEM;
+
+	strcat(opts, param->key);
+
+	if (param->type == fs_value_is_string) {
+		/* param->string length + sizeof('=') */
+		estimated_size += strlen(param->string) + 1;
+		if (free_size < estimated_size)
+			return -ENOMEM;
+
+		strcat(opts, "=");
+		strcat(opts, param->string);
+	}
+
+	strcat(opts, ",");
+
+	return 0;
+}
+
+static inline int fscontext_lookup_bdev(struct fs_context *fc, dev_t *s_dev)
+{
+	dev_t bd_dev;
+	int ret;
+
+	/* can we reach fc->root->d_sb->s_dev ? */
+	if (fc->root && fc->root->d_sb && fc->root->d_sb) {
+		if (s_dev)
+			*s_dev = fc->root->d_sb->s_dev;
+
+		return 0;
+	}
+
+	if (fc->source) {
+		ret = lookup_bdev(fc->source, &bd_dev);
+		if (ret)
+			return ret;
+
+		if (s_dev)
+			*s_dev = bd_dev;
+
+		return 0;
+	}
+
+	return -ENODEV;
+}
+
+static int fscontext_init_lazy_opts(struct fs_context *fc)
+{
+	struct ve_struct *ve = get_exec_env();
+	int fs_flags;
+	char *opts;
+
+	fc->lazy_opts = NULL;
+
+	if (!(fc->purpose == FS_CONTEXT_FOR_MOUNT ||
+	      fc->purpose == FS_CONTEXT_FOR_RECONFIGURE))
+		return 0;
+
+	if (ve_is_super(ve))
+		return 0;
+
+	/*
+	 * Currently, fc->fs_type is filled in fsopen(),
+	 * so this situation is impossible, but let's
+	 * be cautious as fs_context implementation may
+	 * be changed in the future.
+	 */
+	if (!fc->fs_type) {
+		WARN_ON(1);
+		return 0;
+	}
+
+	fs_flags = fc->fs_type->fs_flags;
+
+	/*
+	 * If we already know block device then we can do devmnt checks,
+	 * if fs doesn't require block dev we have to skip it
+	 */
+	if (!(fs_flags & FS_REQUIRES_DEV) || !fscontext_lookup_bdev(fc, NULL))
+		return 0;
+
+	/* we interested in filesystems which can be mounted from inside VE */
+	if (!(fs_flags & FS_VIRTUALIZED))
+		return 0;
+
+	fc->lazy_opts = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
+	if (!fc->lazy_opts)
+		return -ENOMEM;
+
+	opts = fc->lazy_opts;
+
+	/* we will use fc->lazy_opts as a string */
+	*opts = 0;
+
+	return 0;
+}
+
+int vfs_parse_fs_param_lazy(struct fs_context *fc, struct fs_parameter *param)
+{
+	struct ve_struct *ve = get_exec_env();
+
+	/* it means that we didn't turn on lazy mode */
+	if (!fc->lazy_opts)
+		goto non_lazy_way;
+
+	/*
+	 * source parameter should be passed over lazy opts, as it
+	 * describes source bdev for fs
+	 */
+	if (!strcmp(param->key, "source"))
+		goto non_lazy_way;
+
+	if (!(param->type == fs_value_is_flag ||
+	      param->type == fs_value_is_string)) {
+		/*
+		 * In case when we can't serialize fs parameters into
+		 * a string it means that this fs uses blob parameters,
+		 * or fd parameters, or something similar. We can't use
+		 * devmnt to control such mounts.
+		 */
+		ve_pr_warn_ratelimited(VE0_LOG, "VE%s: can't do devmnt checks "
+			"for fstype %s (param key %s, param type %d)\n",
+			ve->ve_name,
+			fc->fs_type->name,
+			param->key,
+			param->type);
+
+		return -EPERM;
+	}
+
+	/*
+	 * Okay, we here it means that we want to collect all mount
+	 * parameters, and then check it all at one time when
+	 * fc->source will be known.
+	 */
+	return __vfs_add_monolithic_fs_param(fc, param);
+
+non_lazy_way:
+	return vfs_parse_fs_param(fc, param);
+}
+EXPORT_SYMBOL(vfs_parse_fs_param_lazy);
+#else
+static inline int fscontext_init_lazy_opts(struct fs_context *fc)
+{
+	fc->lazy_opts = NULL;
+	return 0;
+}
+#endif
+
 /**
  * vfs_parse_fs_string - Convenience function to just parse a string.
  */
@@ -202,6 +369,39 @@ int generic_parse_monolithic(struct fs_context *fc, void *data)
 {
 	char *options = data, *key;
 	int ret = 0;
+#ifdef CONFIG_VE
+	void *options_orig = options;
+	void *options_after;
+	struct ve_struct *ve = get_exec_env();
+
+	if (!ve_is_super(ve) && (fc->fs_type->fs_flags & FS_REQUIRES_DEV)) {
+		dev_t bd_dev;
+
+		ret = fscontext_lookup_bdev(fc, &bd_dev);
+		if (ret) {
+			errorf(fc, "%s: Can't lookup blockdev", fc->source);
+			return -ENODEV;
+		}
+
+		ret = ve_devmnt_process(ve, bd_dev, (void **) &options,
+				fc->purpose == FS_CONTEXT_FOR_RECONFIGURE);
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * Very dangerous place.
+	 * 1. ve_devmnt_process may alloc new page and write address to options
+	 *    variable.
+	 * 2. we have to detect such case and call free_page() if so, but
+	 * 3. we can free page after all options processing, but during that
+	 *    strsep() function modifies options variable too. Ugh.
+	 *
+	 * Let's save page address to options_after and use it to detect new
+	 * page allocation.
+	 */
+	options_after = options;
+#endif
 
 	if (!options)
 		return 0;
@@ -227,6 +427,11 @@ int generic_parse_monolithic(struct fs_context *fc, void *data)
 		}
 	}
 
+#ifdef CONFIG_VE
+	if (options_after != options_orig)
+		free_page((unsigned long)options_after);
+#endif
+
 	return ret;
 }
 EXPORT_SYMBOL(generic_parse_monolithic);
@@ -290,7 +495,13 @@ static struct fs_context *alloc_fs_context(struct file_system_type *fs_type,
 	ret = init_fs_context(fc);
 	if (ret < 0)
 		goto err_fc;
+
 	fc->need_free = true;
+
+	ret = fscontext_init_lazy_opts(fc);
+	if (ret < 0)
+		goto err_fc;
+
 	return fc;
 
 err_fc:
@@ -366,6 +577,10 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
 	if (ret < 0)
 		goto err_fc;
 
+	ret = fscontext_init_lazy_opts(fc);
+	if (ret < 0)
+		goto err_fc;
+
 	ret = security_fs_context_dup(fc, src_fc);
 	if (ret < 0)
 		goto err_fc;
@@ -474,6 +689,8 @@ void put_fs_context(struct fs_context *fc)
 	put_cred(fc->cred);
 	put_fc_log(fc);
 	put_filesystem(fc->fs_type);
+	if (fc->lazy_opts)
+		free_page((unsigned long)fc->lazy_opts);
 	kfree(fc->source);
 	kfree(fc);
 }
@@ -689,6 +906,10 @@ void vfs_clean_context(struct fs_context *fc)
 	fc->s_fs_info = NULL;
 	fc->sb_flags = 0;
 	security_free_mnt_opts(&fc->security);
+	if (fc->lazy_opts) {
+		free_page((unsigned long)fc->lazy_opts);
+		fc->lazy_opts = NULL;
+	}
 	kfree(fc->source);
 	fc->source = NULL;
 
@@ -711,7 +932,15 @@ int finish_clean_context(struct fs_context *fc)
 		fc->phase = FS_CONTEXT_FAILED;
 		return error;
 	}
+
 	fc->need_free = true;
+
+	error = fscontext_init_lazy_opts(fc);
+	if (unlikely(error)) {
+		fc->phase = FS_CONTEXT_FAILED;
+		return error;
+	}
+
 	fc->phase = FS_CONTEXT_RECONF_PARAMS;
 	return 0;
 }
diff --git a/fs/fsopen.c b/fs/fsopen.c
index 27a890aa493a..a0a3b1eb6fcf 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -209,6 +209,26 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
 	return ret;
 }
 
+static int fscontext_finalize_lazy_opts(struct fs_context *fc)
+{
+#ifdef CONFIG_VE
+	if (fc->lazy_opts) {
+		int ret;
+
+		/*
+		 * Now we have fc->source set and can do all delayed
+		 * devmnt checks. We need just to call
+		 * generic_parse_monolithic on our fc->lazy_opts.
+		 */
+		ret = generic_parse_monolithic(fc, fc->lazy_opts);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+#endif
+}
+
 /*
  * Check the state and apply the configuration.  Note that this function is
  * allowed to 'steal' the value by setting param->xxx to NULL before returning.
@@ -228,6 +248,11 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
 			return -EBUSY;
 		if (!mount_capable(fc))
 			return -EPERM;
+
+		ret = fscontext_finalize_lazy_opts(fc);
+		if (ret)
+			return ret;
+
 		fc->phase = FS_CONTEXT_CREATING;
 		ret = vfs_get_tree(fc);
 		if (ret)
@@ -250,6 +275,11 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
 			ret = -EPERM;
 			break;
 		}
+
+		ret = fscontext_finalize_lazy_opts(fc);
+		if (ret)
+			return ret;
+
 		down_write(&sb->s_umount);
 		ret = reconfigure_super(fc);
 		up_write(&sb->s_umount);
@@ -262,7 +292,7 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
 		    fc->phase != FS_CONTEXT_RECONF_PARAMS)
 			return -EBUSY;
 
-		return vfs_parse_fs_param(fc, param);
+		return vfs_parse_fs_param_lazy(fc, param);
 	}
 	fc->phase = FS_CONTEXT_FAILED;
 	return ret;
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 6b54982fc5f3..c937782b3004 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -92,6 +92,7 @@ struct fs_context {
 	struct mutex		uapi_mutex;	/* Userspace access mutex */
 	struct file_system_type	*fs_type;
 	void			*fs_private;	/* The filesystem's context */
+	void			*lazy_opts;	/* mount options which can't be checked at fsconfig() time */
 	void			*sget_key;
 	struct dentry		*root;		/* The root and superblock */
 	struct user_namespace	*user_ns;	/* The user namespace for this mount */
@@ -134,6 +135,16 @@ extern struct fs_context *fs_context_for_submount(struct file_system_type *fs_ty
 
 extern struct fs_context *vfs_dup_fs_context(struct fs_context *fc);
 extern int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param);
+#ifdef CONFIG_VE
+extern int vfs_parse_fs_param_lazy(struct fs_context *fc,
+				   struct fs_parameter *param);
+#else
+static inline int vfs_parse_fs_param_lazy(struct fs_context *fc,
+					  struct fs_parameter *param)
+{
+	return vfs_parse_fs_param(fc, param);
+}
+#endif
 extern int vfs_parse_fs_string(struct fs_context *fc, const char *key,
 			       const char *value, size_t v_size);
 extern int generic_parse_monolithic(struct fs_context *fc, void *data);