[Devel] [PATCH VZ9 v4] fsopen, devmnt, fs_context: add vz devmnt feature support to fs_context

Alexander Mikhalitsyn alexander.mikhalitsyn at virtuozzo.com
Wed Jun 8 19:24:56 MSK 2022


Let's start from some intro. We have devmnt feature which allows
to manage (by writing to /sys/fs/cgroup/ve/CTID/ve.mount_opts) which
mount options is allowed inside the CT and also allows to set e.g.
hidden options which will be appended to user-defined mount options.
It's used primarily to provide balloon_ino option for ext4 and, from
near time, xfs filesystems. User is allowed to mount this filesystems
from inside VE, but of course, user may omit balloon_ino options,
in this case devmnt append it automatically from the kernel side.

Some time ago the a bunch of a new kernel syscalls was added it's
fsopen, fspick, fsconfig, fsmount, move_mount, open_tree [1]. All
of this new kernel API allows userspace to work with detached
VFS trees and also attach this detached trees to mount namespace
(using move_mount syscall). Our devmnt feature implementation hooks
into fs/super.c and fs/namespace.c to catch two possible places
where user may walk. One case is mounting and the second case is
remounting. But after this new fsopen subsystem was appeared
we got major changes in the in-kernel filesystems API, now
filesystems may define their-own fc->ops->get_tree() callback
which have to set root-dentry (fc->root) of the future mount.
If filesystem is not converted to the new API then default
get_tree callback implementation is used (legacy_get_tree).

1. ext4 filesystem uses legacy_get_tree
2. xfs filesystem uses her-own ->get_tree implementation
(using generic get_tree_bdev helper inside).

When legacy_get_tree is used, then old-known mount_bdev()
function is called (and we have corresponding devmnt hook
inside it!). But if filesystem uses her-own get_tree implementation
then it utilizes generic get_tree_bdev helper (in the most cases).
This helper not contains our hooks. You may say, what's the problem
just add our hook inside this function and we will be happy. But
that's not work. Why? Because it's too late to check mount options
in get_tree_bdev() because all options already inside fs-specific
structures and we have no generic way to extract it and check.

Another serious difficulty here, is that user may mount fs like this:

    fd = fsopen("ext4", FSOPEN_CLOEXEC);
    fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
    fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
    fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
    mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

The problem here is that we get source block device as usual option,
and user may, for instance, specify all mount options at first, and
at the end of configuring stage provide source block device path.
But devmnt subsystem do checks basing on block device major & minor
numbers. It means that'we can't just hook fsconfig() and check all
options on time. We have to postpone this check at some "final stage"
when we know source blkdev.

Okay, what we will do?
1. Let's modify fsconfig() syscall implementation, so we
will catch filesystems of our interest (backed by block device
+ !ve_is_super) and do not paththrough parameters to the filesystem directly
but just save all user-specified parameters to temporary
buffer fc->lazy_opts.
2. On FSCONFIG_CMD_CREATE stage we will take our builded mount options
string fc->lazy_opts and check all parameters using ve_devmnt_process
function.

Is this ideal solution? No.
That's why:
1. fsconfig provides filesystem developers a way to get non-standard forms
of filesystem mount options. For instance, filesystem may get file descriptor,
path, binary blob. Of course in such cases we can't just concat such data in
the fc->lazy_opts. This approach works only if filesystem uses classical parameters
like flags and strings. Fortunately, xfs and ext4 in this set of fs'es.
2. Not simple changes in the fsopen code

v2: improved init/free logic, fixed a small issue
v3: small fixes. Thanks to Pavel Tikhomirov for review
v4: added memcg accounting

https://jira.sw.ru/browse/PSBM-133521

[1] http://kernsec.org/pipermail/linux-security-module-archive/2019-February/011442.html

Cc: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
Cc: Konstantin Khorenko <khorenko at virtuozzo.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn at virtuozzo.com>
Feature: fs/mnt: hidden and allowed mount options in CT
---
 fs/fs_context.c            | 226 +++++++++++++++++++++++++++++++++++++
 fs/fsopen.c                |  32 +++++-
 include/linux/fs_context.h |   9 ++
 3 files changed, 266 insertions(+), 1 deletion(-)

diff --git a/fs/fs_context.c b/fs/fs_context.c
index 24ce12f0db32..09c374eea174 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -11,6 +11,7 @@
 #include <linux/fs_context.h>
 #include <linux/fs_parser.h>
 #include <linux/fs.h>
+#include <linux/blkdev.h>
 #include <linux/mount.h>
 #include <linux/nsproxy.h>
 #include <linux/slab.h>
@@ -19,6 +20,7 @@
 #include <linux/mnt_namespace.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/ve.h>
 #include <net/net_namespace.h>
 #include <asm/sections.h>
 #include "mount.h"
@@ -160,6 +162,171 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
 }
 EXPORT_SYMBOL(vfs_parse_fs_param);
 
+#ifdef CONFIG_VE
+int __vfs_add_monolithic_fs_param(struct fs_context *fc, struct fs_parameter *param)
+{
+	char *opts = fc->lazy_opts;
+	size_t free_size;
+	size_t estimated_size = 0;
+
+	BUG_ON(!opts);
+
+	/* lazy_opts length + place for 0-byte */
+	free_size = PAGE_SIZE - (strlen(opts) + 1);
+
+	/* param->key length + sizeof(',') */
+	estimated_size += strlen(param->key) + 1;
+	if (free_size < estimated_size)
+		return -ENOMEM;
+
+	strcat(opts, param->key);
+
+	if (param->type == fs_value_is_string) {
+		/* param->string length + sizeof('=') */
+		estimated_size += strlen(param->string) + 1;
+		if (free_size < estimated_size)
+			return -ENOMEM;
+
+		strcat(opts, "=");
+		strcat(opts, param->string);
+	}
+
+	strcat(opts, ",");
+
+	return 0;
+}
+
+static inline int fscontext_lookup_bdev(struct fs_context *fc, dev_t *s_dev)
+{
+	dev_t bd_dev;
+	int ret;
+
+	/* can we reach fc->root->d_sb->s_dev ? */
+	if (fc->root && fc->root->d_sb && fc->root->d_sb) {
+		if (s_dev)
+			*s_dev = fc->root->d_sb->s_dev;
+
+		return 0;
+	}
+
+	if (fc->source) {
+		ret = lookup_bdev(fc->source, &bd_dev);
+		if (ret)
+			return ret;
+
+		if (s_dev)
+			*s_dev = bd_dev;
+
+		return 0;
+	}
+
+	return -ENODEV;
+}
+
+static int fscontext_init_lazy_opts(struct fs_context *fc)
+{
+	struct ve_struct *ve = get_exec_env();
+	int fs_flags;
+	char *opts;
+
+	fc->lazy_opts = NULL;
+
+	if (!(fc->purpose == FS_CONTEXT_FOR_MOUNT ||
+	      fc->purpose == FS_CONTEXT_FOR_RECONFIGURE))
+		return 0;
+
+	if (ve_is_super(ve))
+		return 0;
+
+	/*
+	 * Currently, fc->fs_type is filled in fsopen(),
+	 * so this situation is impossible, but let's
+	 * be cautious as fs_context implementation may
+	 * be changed in the future.
+	 */
+	if (!fc->fs_type) {
+		WARN_ON(1);
+		return 0;
+	}
+
+	fs_flags = fc->fs_type->fs_flags;
+
+	/*
+	 * If we already know block device then we can do devmnt checks,
+	 * if fs doesn't require block dev we have to skip it
+	 */
+	if (!(fs_flags & FS_REQUIRES_DEV) || !fscontext_lookup_bdev(fc, NULL))
+		return 0;
+
+	/* we interested in filesystems which can be mounted from inside VE */
+	if (!(fs_flags & FS_VIRTUALIZED))
+		return 0;
+
+	fc->lazy_opts = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
+	if (!fc->lazy_opts)
+		return -ENOMEM;
+
+	opts = fc->lazy_opts;
+
+	/* we will use fc->lazy_opts as a string */
+	*opts = 0;
+
+	return 0;
+}
+
+int vfs_parse_fs_param_lazy(struct fs_context *fc, struct fs_parameter *param)
+{
+	struct ve_struct *ve = get_exec_env();
+
+	/* it means that we didn't turn on lazy mode */
+	if (!fc->lazy_opts)
+		goto non_lazy_way;
+
+	/*
+	 * source parameter should be passed over lazy opts, as it
+	 * describes source bdev for fs
+	 */
+	if (!strcmp(param->key, "source"))
+		goto non_lazy_way;
+
+	if (!(param->type == fs_value_is_flag ||
+	      param->type == fs_value_is_string)) {
+		/*
+		 * In case when we can't serialize fs parameters into
+		 * a string it means that this fs uses blob parameters,
+		 * or fd parameters, or something similar. We can't use
+		 * devmnt to control such mounts.
+		 */
+		ve_pr_warn_ratelimited(VE0_LOG, "VE%s: can't do devmnt checks "
+					  "for fstype %s (param key %s, param type %d)\n",
+					  ve->ve_name,
+					  fc->fs_type->name,
+					  param->key,
+					  param->type
+					);
+
+		return -EPERM;
+	}
+
+	/*
+	 * Okay, we here it means that we want to collect all mount
+	 * parameters, and then check it all at one time when
+	 * fc->source will be known.
+	 */
+	return __vfs_add_monolithic_fs_param(fc, param);
+
+non_lazy_way:
+	return vfs_parse_fs_param(fc, param);
+}
+EXPORT_SYMBOL(vfs_parse_fs_param_lazy);
+#else
+static inline int fscontext_init_lazy_opts(struct fs_context *fc)
+{
+	fc->lazy_opts = NULL;
+	return 0;
+}
+#endif
+
 /**
  * vfs_parse_fs_string - Convenience function to just parse a string.
  */
@@ -202,6 +369,36 @@ int generic_parse_monolithic(struct fs_context *fc, void *data)
 {
 	char *options = data, *key;
 	int ret = 0;
+#ifdef CONFIG_VE
+	void *options_orig = options;
+	void *options_after;
+	struct ve_struct *ve = get_exec_env();
+
+	if (!ve_is_super(ve) && (fc->fs_type->fs_flags & FS_REQUIRES_DEV)) {
+		dev_t bd_dev;
+
+		ret = fscontext_lookup_bdev(fc, &bd_dev);
+		if (ret) {
+			errorf(fc, "%s: Can't lookup blockdev", fc->source);
+			return -ENODEV;
+		}
+
+		ret = ve_devmnt_process(ve, bd_dev, (void **) &options,
+					fc->purpose == FS_CONTEXT_FOR_RECONFIGURE);
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * Very dangerous place.
+	 * 1. ve_devmnt_process may alloc new page and write address to options variable.
+	 * 2. we have to detect such case and call free_page() if so, but
+	 * 3. we can free page after all options processing, but during that
+	 * strsep() function modifies options variable too. Ugh.
+	 * Let's save page address to options_after and use it to detect new page allocation.
+	 */
+	options_after = options;
+#endif
 
 	if (!options)
 		return 0;
@@ -227,6 +424,11 @@ int generic_parse_monolithic(struct fs_context *fc, void *data)
 		}
 	}
 
+#ifdef CONFIG_VE
+	if (options_after != options_orig)
+		free_page((unsigned long)options_after);
+#endif
+
 	return ret;
 }
 EXPORT_SYMBOL(generic_parse_monolithic);
@@ -290,7 +492,13 @@ static struct fs_context *alloc_fs_context(struct file_system_type *fs_type,
 	ret = init_fs_context(fc);
 	if (ret < 0)
 		goto err_fc;
+
 	fc->need_free = true;
+
+	ret = fscontext_init_lazy_opts(fc);
+	if (ret < 0)
+		goto err_fc;
+
 	return fc;
 
 err_fc:
@@ -366,6 +574,10 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
 	if (ret < 0)
 		goto err_fc;
 
+	ret = fscontext_init_lazy_opts(fc);
+	if (ret < 0)
+		goto err_fc;
+
 	ret = security_fs_context_dup(fc, src_fc);
 	if (ret < 0)
 		goto err_fc;
@@ -474,6 +686,8 @@ void put_fs_context(struct fs_context *fc)
 	put_cred(fc->cred);
 	put_fc_log(fc);
 	put_filesystem(fc->fs_type);
+	if (fc->lazy_opts)
+		free_page((unsigned long)fc->lazy_opts);
 	kfree(fc->source);
 	kfree(fc);
 }
@@ -689,6 +903,10 @@ void vfs_clean_context(struct fs_context *fc)
 	fc->s_fs_info = NULL;
 	fc->sb_flags = 0;
 	security_free_mnt_opts(&fc->security);
+	if (fc->lazy_opts) {
+		free_page((unsigned long)fc->lazy_opts);
+		fc->lazy_opts = NULL;
+	}
 	kfree(fc->source);
 	fc->source = NULL;
 
@@ -711,7 +929,15 @@ int finish_clean_context(struct fs_context *fc)
 		fc->phase = FS_CONTEXT_FAILED;
 		return error;
 	}
+
 	fc->need_free = true;
+
+	error = fscontext_init_lazy_opts(fc);
+	if (unlikely(error)) {
+		fc->phase = FS_CONTEXT_FAILED;
+		return error;
+	}
+
 	fc->phase = FS_CONTEXT_RECONF_PARAMS;
 	return 0;
 }
diff --git a/fs/fsopen.c b/fs/fsopen.c
index 27a890aa493a..a0a3b1eb6fcf 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -209,6 +209,26 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
 	return ret;
 }
 
+static int fscontext_finalize_lazy_opts(struct fs_context *fc)
+{
+#ifdef CONFIG_VE
+	if (fc->lazy_opts) {
+		int ret;
+
+		/*
+		 * Now we have fc->source set and can do all delayed
+		 * devmnt checks. We need just to call
+		 * generic_parse_monolithic on our fc->lazy_opts.
+		 */
+		ret = generic_parse_monolithic(fc, fc->lazy_opts);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+#endif
+}
+
 /*
  * Check the state and apply the configuration.  Note that this function is
  * allowed to 'steal' the value by setting param->xxx to NULL before returning.
@@ -228,6 +248,11 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
 			return -EBUSY;
 		if (!mount_capable(fc))
 			return -EPERM;
+
+		ret = fscontext_finalize_lazy_opts(fc);
+		if (ret)
+			return ret;
+
 		fc->phase = FS_CONTEXT_CREATING;
 		ret = vfs_get_tree(fc);
 		if (ret)
@@ -250,6 +275,11 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
 			ret = -EPERM;
 			break;
 		}
+
+		ret = fscontext_finalize_lazy_opts(fc);
+		if (ret)
+			return ret;
+
 		down_write(&sb->s_umount);
 		ret = reconfigure_super(fc);
 		up_write(&sb->s_umount);
@@ -262,7 +292,7 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
 		    fc->phase != FS_CONTEXT_RECONF_PARAMS)
 			return -EBUSY;
 
-		return vfs_parse_fs_param(fc, param);
+		return vfs_parse_fs_param_lazy(fc, param);
 	}
 	fc->phase = FS_CONTEXT_FAILED;
 	return ret;
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 6b54982fc5f3..9d966c85d0a1 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -92,6 +92,7 @@ struct fs_context {
 	struct mutex		uapi_mutex;	/* Userspace access mutex */
 	struct file_system_type	*fs_type;
 	void			*fs_private;	/* The filesystem's context */
+	void			*lazy_opts;	/* mount options which can't be checked at fsconfig() time */
 	void			*sget_key;
 	struct dentry		*root;		/* The root and superblock */
 	struct user_namespace	*user_ns;	/* The user namespace for this mount */
@@ -134,6 +135,14 @@ extern struct fs_context *fs_context_for_submount(struct file_system_type *fs_ty
 
 extern struct fs_context *vfs_dup_fs_context(struct fs_context *fc);
 extern int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param);
+#ifdef CONFIG_VE
+extern int vfs_parse_fs_param_lazy(struct fs_context *fc, struct fs_parameter *param);
+#else
+static inline int vfs_parse_fs_param_lazy(struct fs_context *fc, struct fs_parameter *param)
+{
+	return vfs_parse_fs_param(fc, param);
+}
+#endif
 extern int vfs_parse_fs_string(struct fs_context *fc, const char *key,
 			       const char *value, size_t v_size);
 extern int generic_parse_monolithic(struct fs_context *fc, void *data);
-- 
2.36.1



More information about the Devel mailing list