[Devel] [PATCH VZ9 v4] fsopen, devmnt, fs_context: add vz devmnt feature support to fs_context

Pavel Tikhomirov ptikhomirov at virtuozzo.com
Thu Jun 9 11:58:40 MSK 2022



On 08.06.2022 19:24, Alexander Mikhalitsyn wrote:
> Let's start from some intro. We have devmnt feature which allows
> to manage (by writing to /sys/fs/cgroup/ve/CTID/ve.mount_opts) which
> mount options is allowed inside the CT and also allows to set e.g.
> hidden options which will be appended to user-defined mount options.
> It's used primarily to provide balloon_ino option for ext4 and, from
> near time, xfs filesystems. User is allowed to mount this filesystems
> from inside VE, but of course, user may omit balloon_ino options,
> in this case devmnt append it automatically from the kernel side.
> 
> Some time ago the a bunch of a new kernel syscalls was added it's
> fsopen, fspick, fsconfig, fsmount, move_mount, open_tree [1]. All
> of this new kernel API allows userspace to work with detached
> VFS trees and also attach this detached trees to mount namespace
> (using move_mount syscall). Our devmnt feature implementation hooks
> into fs/super.c and fs/namespace.c to catch two possible places
> where user may walk. One case is mounting and the second case is
> remounting. But after this new fsopen subsystem was appeared
> we got major changes in the in-kernel filesystems API, now
> filesystems may define their-own fc->ops->get_tree() callback
> which have to set root-dentry (fc->root) of the future mount.
> If filesystem is not converted to the new API then default
> get_tree callback implementation is used (legacy_get_tree).
> 
> 1. ext4 filesystem uses legacy_get_tree
> 2. xfs filesystem uses her-own ->get_tree implementation
> (using generic get_tree_bdev helper inside).
> 
> When legacy_get_tree is used, then old-known mount_bdev()
> function is called (and we have corresponding devmnt hook
> inside it!). But if filesystem uses her-own get_tree implementation
> then it utilizes generic get_tree_bdev helper (in the most cases).
> This helper not contains our hooks. You may say, what's the problem
> just add our hook inside this function and we will be happy. But
> that's not work. Why? Because it's too late to check mount options
> in get_tree_bdev() because all options already inside fs-specific
> structures and we have no generic way to extract it and check.
> 
> Another serious difficulty here, is that user may mount fs like this:
> 
>      fd = fsopen("ext4", FSOPEN_CLOEXEC);
>      fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
>      fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
>      fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
>      mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
> 
> The problem here is that we get source block device as usual option,
> and user may, for instance, specify all mount options at first, and
> at the end of configuring stage provide source block device path.
> But devmnt subsystem do checks basing on block device major & minor
> numbers. It means that'we can't just hook fsconfig() and check all
> options on time. We have to postpone this check at some "final stage"
> when we know source blkdev.
> 
> Okay, what we will do?
> 1. Let's modify fsconfig() syscall implementation, so we
> will catch filesystems of our interest (backed by block device
> + !ve_is_super) and do not paththrough parameters to the filesystem directly
> but just save all user-specified parameters to temporary
> buffer fc->lazy_opts.
> 2. On FSCONFIG_CMD_CREATE stage we will take our builded mount options
> string fc->lazy_opts and check all parameters using ve_devmnt_process
> function.
> 
> Is this ideal solution? No.
> That's why:
> 1. fsconfig provides filesystem developers a way to get non-standard forms
> of filesystem mount options. For instance, filesystem may get file descriptor,
> path, binary blob. Of course in such cases we can't just concat such data in
> the fc->lazy_opts. This approach works only if filesystem uses classical parameters
> like flags and strings. Fortunately, xfs and ext4 in this set of fs'es.
> 2. Not simple changes in the fsopen code
> 
> v2: improved init/free logic, fixed a small issue
> v3: small fixes. Thanks to Pavel Tikhomirov for review
> v4: added memcg accounting
> 
> https://jira.sw.ru/browse/PSBM-133521
> 
> [1] http://kernsec.org/pipermail/linux-security-module-archive/2019-February/011442.html
> 

Reviewed-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>

> Cc: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
> Cc: Konstantin Khorenko <khorenko at virtuozzo.com>
> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn at virtuozzo.com>
> Feature: fs/mnt: hidden and allowed mount options in CT
> ---
>   fs/fs_context.c            | 226 +++++++++++++++++++++++++++++++++++++
>   fs/fsopen.c                |  32 +++++-
>   include/linux/fs_context.h |   9 ++
>   3 files changed, 266 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fs_context.c b/fs/fs_context.c
> index 24ce12f0db32..09c374eea174 100644
> --- a/fs/fs_context.c
> +++ b/fs/fs_context.c
> @@ -11,6 +11,7 @@
>   #include <linux/fs_context.h>
>   #include <linux/fs_parser.h>
>   #include <linux/fs.h>
> +#include <linux/blkdev.h>
>   #include <linux/mount.h>
>   #include <linux/nsproxy.h>
>   #include <linux/slab.h>
> @@ -19,6 +20,7 @@
>   #include <linux/mnt_namespace.h>
>   #include <linux/pid_namespace.h>
>   #include <linux/user_namespace.h>
> +#include <linux/ve.h>
>   #include <net/net_namespace.h>
>   #include <asm/sections.h>
>   #include "mount.h"
> @@ -160,6 +162,171 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
>   }
>   EXPORT_SYMBOL(vfs_parse_fs_param);
>   
> +#ifdef CONFIG_VE
> +int __vfs_add_monolithic_fs_param(struct fs_context *fc, struct fs_parameter *param)
> +{
> +	char *opts = fc->lazy_opts;
> +	size_t free_size;
> +	size_t estimated_size = 0;
> +
> +	BUG_ON(!opts);
> +
> +	/* lazy_opts length + place for 0-byte */
> +	free_size = PAGE_SIZE - (strlen(opts) + 1);
> +
> +	/* param->key length + sizeof(',') */
> +	estimated_size += strlen(param->key) + 1;
> +	if (free_size < estimated_size)
> +		return -ENOMEM;
> +
> +	strcat(opts, param->key);
> +
> +	if (param->type == fs_value_is_string) {
> +		/* param->string length + sizeof('=') */
> +		estimated_size += strlen(param->string) + 1;
> +		if (free_size < estimated_size)
> +			return -ENOMEM;
> +
> +		strcat(opts, "=");
> +		strcat(opts, param->string);
> +	}
> +
> +	strcat(opts, ",");
> +
> +	return 0;
> +}
> +
> +static inline int fscontext_lookup_bdev(struct fs_context *fc, dev_t *s_dev)
> +{
> +	dev_t bd_dev;
> +	int ret;
> +
> +	/* can we reach fc->root->d_sb->s_dev ? */
> +	if (fc->root && fc->root->d_sb && fc->root->d_sb) {
> +		if (s_dev)
> +			*s_dev = fc->root->d_sb->s_dev;
> +
> +		return 0;
> +	}
> +
> +	if (fc->source) {
> +		ret = lookup_bdev(fc->source, &bd_dev);
> +		if (ret)
> +			return ret;
> +
> +		if (s_dev)
> +			*s_dev = bd_dev;
> +
> +		return 0;
> +	}
> +
> +	return -ENODEV;
> +}
> +
> +static int fscontext_init_lazy_opts(struct fs_context *fc)
> +{
> +	struct ve_struct *ve = get_exec_env();
> +	int fs_flags;
> +	char *opts;
> +
> +	fc->lazy_opts = NULL;
> +
> +	if (!(fc->purpose == FS_CONTEXT_FOR_MOUNT ||
> +	      fc->purpose == FS_CONTEXT_FOR_RECONFIGURE))
> +		return 0;
> +
> +	if (ve_is_super(ve))
> +		return 0;
> +
> +	/*
> +	 * Currently, fc->fs_type is filled in fsopen(),
> +	 * so this situation is impossible, but let's
> +	 * be cautious as fs_context implementation may
> +	 * be changed in the future.
> +	 */
> +	if (!fc->fs_type) {
> +		WARN_ON(1);
> +		return 0;
> +	}
> +
> +	fs_flags = fc->fs_type->fs_flags;
> +
> +	/*
> +	 * If we already know block device then we can do devmnt checks,
> +	 * if fs doesn't require block dev we have to skip it
> +	 */
> +	if (!(fs_flags & FS_REQUIRES_DEV) || !fscontext_lookup_bdev(fc, NULL))
> +		return 0;
> +
> +	/* we interested in filesystems which can be mounted from inside VE */
> +	if (!(fs_flags & FS_VIRTUALIZED))
> +		return 0;
> +
> +	fc->lazy_opts = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
> +	if (!fc->lazy_opts)
> +		return -ENOMEM;
> +
> +	opts = fc->lazy_opts;
> +
> +	/* we will use fc->lazy_opts as a string */
> +	*opts = 0;
> +
> +	return 0;
> +}
> +
> +int vfs_parse_fs_param_lazy(struct fs_context *fc, struct fs_parameter *param)
> +{
> +	struct ve_struct *ve = get_exec_env();
> +
> +	/* it means that we didn't turn on lazy mode */
> +	if (!fc->lazy_opts)
> +		goto non_lazy_way;
> +
> +	/*
> +	 * source parameter should be passed over lazy opts, as it
> +	 * describes source bdev for fs
> +	 */
> +	if (!strcmp(param->key, "source"))
> +		goto non_lazy_way;
> +
> +	if (!(param->type == fs_value_is_flag ||
> +	      param->type == fs_value_is_string)) {
> +		/*
> +		 * In case when we can't serialize fs parameters into
> +		 * a string it means that this fs uses blob parameters,
> +		 * or fd parameters, or something similar. We can't use
> +		 * devmnt to control such mounts.
> +		 */
> +		ve_pr_warn_ratelimited(VE0_LOG, "VE%s: can't do devmnt checks "
> +					  "for fstype %s (param key %s, param type %d)\n",
> +					  ve->ve_name,
> +					  fc->fs_type->name,
> +					  param->key,
> +					  param->type
> +					);
> +
> +		return -EPERM;
> +	}
> +
> +	/*
> +	 * Okay, we here it means that we want to collect all mount
> +	 * parameters, and then check it all at one time when
> +	 * fc->source will be known.
> +	 */
> +	return __vfs_add_monolithic_fs_param(fc, param);
> +
> +non_lazy_way:
> +	return vfs_parse_fs_param(fc, param);
> +}
> +EXPORT_SYMBOL(vfs_parse_fs_param_lazy);
> +#else
> +static inline int fscontext_init_lazy_opts(struct fs_context *fc)
> +{
> +	fc->lazy_opts = NULL;
> +	return 0;
> +}
> +#endif
> +
>   /**
>    * vfs_parse_fs_string - Convenience function to just parse a string.
>    */
> @@ -202,6 +369,36 @@ int generic_parse_monolithic(struct fs_context *fc, void *data)
>   {
>   	char *options = data, *key;
>   	int ret = 0;
> +#ifdef CONFIG_VE
> +	void *options_orig = options;
> +	void *options_after;
> +	struct ve_struct *ve = get_exec_env();
> +
> +	if (!ve_is_super(ve) && (fc->fs_type->fs_flags & FS_REQUIRES_DEV)) {
> +		dev_t bd_dev;
> +
> +		ret = fscontext_lookup_bdev(fc, &bd_dev);
> +		if (ret) {
> +			errorf(fc, "%s: Can't lookup blockdev", fc->source);
> +			return -ENODEV;
> +		}
> +
> +		ret = ve_devmnt_process(ve, bd_dev, (void **) &options,
> +					fc->purpose == FS_CONTEXT_FOR_RECONFIGURE);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	/*
> +	 * Very dangerous place.
> +	 * 1. ve_devmnt_process may alloc new page and write address to options variable.
> +	 * 2. we have to detect such case and call free_page() if so, but
> +	 * 3. we can free page after all options processing, but during that
> +	 * strsep() function modifies options variable too. Ugh.
> +	 * Let's save page address to options_after and use it to detect new page allocation.
> +	 */
> +	options_after = options;
> +#endif
>   
>   	if (!options)
>   		return 0;
> @@ -227,6 +424,11 @@ int generic_parse_monolithic(struct fs_context *fc, void *data)
>   		}
>   	}
>   
> +#ifdef CONFIG_VE
> +	if (options_after != options_orig)
> +		free_page((unsigned long)options_after);
> +#endif
> +
>   	return ret;
>   }
>   EXPORT_SYMBOL(generic_parse_monolithic);
> @@ -290,7 +492,13 @@ static struct fs_context *alloc_fs_context(struct file_system_type *fs_type,
>   	ret = init_fs_context(fc);
>   	if (ret < 0)
>   		goto err_fc;
> +
>   	fc->need_free = true;
> +
> +	ret = fscontext_init_lazy_opts(fc);
> +	if (ret < 0)
> +		goto err_fc;
> +
>   	return fc;
>   
>   err_fc:
> @@ -366,6 +574,10 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
>   	if (ret < 0)
>   		goto err_fc;
>   
> +	ret = fscontext_init_lazy_opts(fc);
> +	if (ret < 0)
> +		goto err_fc;
> +
>   	ret = security_fs_context_dup(fc, src_fc);
>   	if (ret < 0)
>   		goto err_fc;
> @@ -474,6 +686,8 @@ void put_fs_context(struct fs_context *fc)
>   	put_cred(fc->cred);
>   	put_fc_log(fc);
>   	put_filesystem(fc->fs_type);
> +	if (fc->lazy_opts)
> +		free_page((unsigned long)fc->lazy_opts);
>   	kfree(fc->source);
>   	kfree(fc);
>   }
> @@ -689,6 +903,10 @@ void vfs_clean_context(struct fs_context *fc)
>   	fc->s_fs_info = NULL;
>   	fc->sb_flags = 0;
>   	security_free_mnt_opts(&fc->security);
> +	if (fc->lazy_opts) {
> +		free_page((unsigned long)fc->lazy_opts);
> +		fc->lazy_opts = NULL;
> +	}
>   	kfree(fc->source);
>   	fc->source = NULL;
>   
> @@ -711,7 +929,15 @@ int finish_clean_context(struct fs_context *fc)
>   		fc->phase = FS_CONTEXT_FAILED;
>   		return error;
>   	}
> +
>   	fc->need_free = true;
> +
> +	error = fscontext_init_lazy_opts(fc);
> +	if (unlikely(error)) {
> +		fc->phase = FS_CONTEXT_FAILED;
> +		return error;
> +	}
> +
>   	fc->phase = FS_CONTEXT_RECONF_PARAMS;
>   	return 0;
>   }
> diff --git a/fs/fsopen.c b/fs/fsopen.c
> index 27a890aa493a..a0a3b1eb6fcf 100644
> --- a/fs/fsopen.c
> +++ b/fs/fsopen.c
> @@ -209,6 +209,26 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
>   	return ret;
>   }
>   
> +static int fscontext_finalize_lazy_opts(struct fs_context *fc)
> +{
> +#ifdef CONFIG_VE
> +	if (fc->lazy_opts) {
> +		int ret;
> +
> +		/*
> +		 * Now we have fc->source set and can do all delayed
> +		 * devmnt checks. We need just to call
> +		 * generic_parse_monolithic on our fc->lazy_opts.
> +		 */
> +		ret = generic_parse_monolithic(fc, fc->lazy_opts);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +#endif
> +}
> +
>   /*
>    * Check the state and apply the configuration.  Note that this function is
>    * allowed to 'steal' the value by setting param->xxx to NULL before returning.
> @@ -228,6 +248,11 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
>   			return -EBUSY;
>   		if (!mount_capable(fc))
>   			return -EPERM;
> +
> +		ret = fscontext_finalize_lazy_opts(fc);
> +		if (ret)
> +			return ret;
> +
>   		fc->phase = FS_CONTEXT_CREATING;
>   		ret = vfs_get_tree(fc);
>   		if (ret)
> @@ -250,6 +275,11 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
>   			ret = -EPERM;
>   			break;
>   		}
> +
> +		ret = fscontext_finalize_lazy_opts(fc);
> +		if (ret)
> +			return ret;
> +
>   		down_write(&sb->s_umount);
>   		ret = reconfigure_super(fc);
>   		up_write(&sb->s_umount);
> @@ -262,7 +292,7 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
>   		    fc->phase != FS_CONTEXT_RECONF_PARAMS)
>   			return -EBUSY;
>   
> -		return vfs_parse_fs_param(fc, param);
> +		return vfs_parse_fs_param_lazy(fc, param);
>   	}
>   	fc->phase = FS_CONTEXT_FAILED;
>   	return ret;
> diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
> index 6b54982fc5f3..9d966c85d0a1 100644
> --- a/include/linux/fs_context.h
> +++ b/include/linux/fs_context.h
> @@ -92,6 +92,7 @@ struct fs_context {
>   	struct mutex		uapi_mutex;	/* Userspace access mutex */
>   	struct file_system_type	*fs_type;
>   	void			*fs_private;	/* The filesystem's context */
> +	void			*lazy_opts;	/* mount options which can't be checked at fsconfig() time */
>   	void			*sget_key;
>   	struct dentry		*root;		/* The root and superblock */
>   	struct user_namespace	*user_ns;	/* The user namespace for this mount */
> @@ -134,6 +135,14 @@ extern struct fs_context *fs_context_for_submount(struct file_system_type *fs_ty
>   
>   extern struct fs_context *vfs_dup_fs_context(struct fs_context *fc);
>   extern int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param);
> +#ifdef CONFIG_VE
> +extern int vfs_parse_fs_param_lazy(struct fs_context *fc, struct fs_parameter *param);
> +#else
> +static inline int vfs_parse_fs_param_lazy(struct fs_context *fc, struct fs_parameter *param)
> +{
> +	return vfs_parse_fs_param(fc, param);
> +}
> +#endif
>   extern int vfs_parse_fs_string(struct fs_context *fc, const char *key,
>   			       const char *value, size_t v_size);
>   extern int generic_parse_monolithic(struct fs_context *fc, void *data);

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


More information about the Devel mailing list