[Devel] [RFC PATCH v2 RH9] fsopen, devmnt, fs_context: add vz devmnt feature support to fs_context
Pavel Tikhomirov
ptikhomirov at virtuozzo.com
Thu Dec 9 12:10:38 MSK 2021
Looks good, please see minor nits in inline comments.
On 02.12.2021 23:38, Alexander Mikhalitsyn wrote:
> That's RFC patch.
>
> Let's start from some intro. We have devmnt feature which allows
> to manage (by writing to /sys/fs/cgroup/ve/CTID/ve.mount_opts) which
> mount options is allowed inside the CT and also allows to set e.g.
> hidden options which will be appended to user-defined mount options.
> It's used primarily to provide balloon_ino option for ext4 and, from
> near time, xfs filesystems. User is allowed to mount this filesystems
> from inside VE, but of course, user may omit balloon_ino options,
> in this case devmnt append it automatically from the kernel side.
>
> Some time ago the a bunch of a new kernel syscalls was added it's
> fsopen, fspick, fsconfig, fsmount, move_mount, open_tree [1]. All
> of this new kernel API allows userspace to work with detached
> VFS trees and also attach this detached trees to mount namespace
> (using move_mount syscall). Our devmnt feature implementation hooks
> into fs/super.c and fs/namespace.c to catch two possible places
> where user may walk. One case is mounting and the second case is
> remounting. But after this new fsopen subsystem was appeared
> we got major changes in the in-kernel filesystems API, now
> filesystems may define their-own fc->ops->get_tree() callback
> which have to set root-dentry (fc->root) of the future mount.
> If filesystem is not converted to the new API then default
> get_tree callback implementation is used (legacy_get_tree).
>
> 1. ext4 filesystem uses legacy_get_tree
> 2. xfs filesystem uses her-own ->get_tree implementation
> (using generic get_tree_bdev helper inside).
>
> When legacy_get_tree is used, then old-known mount_bdev()
> function is called (and we have corresponding devmnt hook
> inside it!). But if filesystem uses her-own get_tree implementation
> then it utilizes generic get_tree_bdev helper (in the most cases).
> This helper not contains our hooks. You may say, what's the problem
> just add our hook inside this function and we will be happy. But
> that's not work. Why? Because it's too late to check mount options
> in get_tree_bdev() because all options already inside fs-specific
> structures and we have no generic way to extract it and check.
>
> Another serious difficulty here, is that user may mount fs like this:
>
> fd = fsopen("ext4", FSOPEN_CLOEXEC);
> fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
> fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
> fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
> mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
>
> The problem here is that we get source block device as usual option,
> and user may, for instance, specify all mount options at first, and
> at the end of configuring stage provide source block device path.
> But devmnt subsystem do checks basing on block device major & minor
> numbers. It means that'we can't just hook fsconfig() and check all
> options on time. We have to postpone this check at some "final stage"
> when we know source blkdev.
>
> Okay, what we will do?
> 1. Let's modify fsconfig() syscall implementation, so we
> will catch filesystems of our interest (backed by block device
> + !ve_is_super) and do not paththrough parameters to the filesystem directly
> but just save all user-specified parameters to temporary
> buffer fc->lazy_opts.
> 2. On FSCONFIG_CMD_CREATE stage we will take our builded mount options
> string fc->lazy_opts and check all parameters using ve_devmnt_process
> function.
>
> Is this ideal solution? No.
> That's why:
> 1. fsconfig provides filesystem developers a way to get non-standard forms
> of filesystem mount options. For instance, filesystem may get file descriptor,
> path, binary blob. Of course in such cases we can't just concat such data in
> the fc->lazy_opts. This approach works only if filesystem uses classical parameters
> like flags and strings. Fortunately, xfs and ext4 in this set of fs'es.
> 2. Not simple changes in the fsopen code
>
> v2: improved init/free logic, fixed a small issue
>
> https://jira.sw.ru/browse/PSBM-133521
>
> [1] http://kernsec.org/pipermail/linux-security-module-archive/2019-February/011442.html
>
> Cc: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
> Cc: Kirill Tkhai <ktkhai at virtuozzo.com>
> Cc: Cyrill Gorcunov <gorcunov at gmail.com>
> Cc: Konstantin Khorenko <khorenko at virtuozzo.com>
> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn at virtuozzo.com>
> ---
> fs/fs_context.c | 219 +++++++++++++++++++++++++++++++++++++
> fs/fsopen.c | 31 +++++-
> include/linux/fs_context.h | 9 ++
> 3 files changed, 258 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fs_context.c b/fs/fs_context.c
> index de1985eae535..84aa7d8b932b 100644
> --- a/fs/fs_context.c
> +++ b/fs/fs_context.c
> @@ -11,6 +11,7 @@
> #include <linux/fs_context.h>
> #include <linux/fs_parser.h>
> #include <linux/fs.h>
> +#include <linux/blkdev.h>
> #include <linux/mount.h>
> #include <linux/nsproxy.h>
> #include <linux/slab.h>
> @@ -19,6 +20,7 @@
> #include <linux/mnt_namespace.h>
> #include <linux/pid_namespace.h>
> #include <linux/user_namespace.h>
> +#include <linux/ve.h>
> #include <net/net_namespace.h>
> #include <asm/sections.h>
> #include "mount.h"
> @@ -160,6 +162,164 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
> }
> EXPORT_SYMBOL(vfs_parse_fs_param);
>
> +#ifdef CONFIG_VE
> +int __vfs_add_monolithic_fs_param(struct fs_context *fc, struct fs_parameter *param)
> +{
> + char *opts = fc->lazy_opts;
> + size_t free_size;
> +
> + BUG_ON(!fc->lazy_opts);
BUG_ON(!opts);
> +
> + /* lazy_opts length + place for 0-byte */
> + free_size = PAGE_SIZE - (strlen(fc->lazy_opts) + 1);
Maybe we can use opts everywhere instead of fc->lazy_opts.
> +
> + /* param->key length + sizeof(',') */
> + if (free_size < (strlen(param->key) + 1))
> + return -ENOMEM;
> +
> + strcat(opts, param->key);
> +
> + if (param->type == fs_value_is_string) {
> + /* param->string length + sizeof('=') */
> + if (free_size < (strlen(param->string) + 1))
This if contition should likely be (feel free to simplify):
if (free_size < (strlen(param->key) + 1 + strlen(param->string) + 1))
> + return -ENOMEM;
> +
> + strcat(opts, "=");
> + strcat(opts, param->string);
> + }
> +
> + strcat(opts, ",");
> +
> + return 0;
> +}
> +
> +static inline int fscontext_lookup_bdev(struct fs_context *fc, dev_t *s_dev)
> +{
> + dev_t bd_dev;
> + int ret;
> +
> + /* can we reach fc->root->d_sb->s_dev ? */
> + if (fc->root && fc->root->d_sb && fc->root->d_sb) {
> + if (s_dev)
> + *s_dev = fc->root->d_sb->s_dev;
> +
> + return 0;
> + }
> +
> + if (fc->source) {
> + ret = lookup_bdev(fc->source, &bd_dev);
> + if (ret)
> + return 1;
return ret;
> +
> + if (s_dev)
> + *s_dev = bd_dev;
> +
> + return 0;
> + }
> +
> + return 1;
return -ENODEV;
> +}
> +
> +static int fscontext_init_lazy_opts(struct fs_context *fc)
> +{
> + struct ve_struct *ve = get_exec_env();
> + int fs_flags;
> + char *opts;
> +
> + fc->lazy_opts = NULL;
> +
> + if (!(fc->purpose == FS_CONTEXT_FOR_MOUNT ||
> + fc->purpose == FS_CONTEXT_FOR_RECONFIGURE))
> + return 0;
> +
> + if (ve_is_super(ve))
> + return 0; > +
> + /* Currently, fc->fs_type is filled in fsopen(),
> + * so this situation is impossible, but let's
> + * be cautious as fs_context implementation may
> + * be changed in the future.
> + */
Please use proper multi-line comment style everywhere
https://www.kernel.org/doc/html/v4.10/process/coding-style.html#commenting
> + if (!fc->fs_type) {
> + WARN_ON(1);
> + return 0;
> + }
> +
> + fs_flags = fc->fs_type->fs_flags;
> +
> + /* If we already know block device then we can do devmnt checks,
> + * if fs doesn't require block dev we have to skip it
> + */
> + if (!(fs_flags & FS_REQUIRES_DEV) || !fscontext_lookup_bdev(fc, NULL))
> + return 0;
> +
> + /* we interested in filesystems which can be mounted from inside VE
> + */
> + if (!(fs_flags & FS_VIRTUALIZED || fs_flags & FS_VE_MOUNT))
> + return 0;
FS_VIRTUALIZED should be enough. FS_VE_MOUNT is an alternative to
FS_USERNS_MOUNT to determine if mount is allowed in each userns or only
ve init one. Still all fs-es allowed in ve should have FS_VIRTUALIZED
AFAICR.
> +
> + fc->lazy_opts = (void *)__get_free_page(GFP_KERNEL);
> + if (!fc->lazy_opts)
> + return -ENOMEM;
> +
> + opts = fc->lazy_opts;
> +
> + /* we will use fc->lazy_opts as a string */
> + *opts = 0;
> +
> + return 0;
> +}
> +
> +int vfs_parse_fs_param_lazy(struct fs_context *fc, struct fs_parameter *param)
> +{
> + struct ve_struct *ve = get_exec_env();
> +
> + /* it means that we didn't turn on lazy mode */
> + if (!fc->lazy_opts)
> + goto non_lazy_way;
Probably we need some WARN_ONCE(!ve_is_super(ve) && !fc->lazy_opts);
here to detect filesystems which evade devmnt checks.
> +
> + /* source parameter should be passed over lazy opts, as it
> + * describes source bdev for fs
> + */
> + if (!strcmp(param->key, "source"))
> + goto non_lazy_way;
> +
> + if (!(param->type == fs_value_is_flag ||
> + param->type == fs_value_is_string)) {
> + /* In case when we can't serialize fs parameters into
> + * a string it means that this fs uses blob parameters,
> + * or fd parameters, or something similar. We can't use
> + * devmnt to control such mounts.
> + */
> + ve_pr_warn_ratelimited(VE0_LOG, "VE%s: can't do devmnt checks "
> + "for fstype %s (param key %s, param type %d)\n",
> + ve->ve_name,
> + fc->fs_type->name,
> + param->key,
> + param->type
> + );
> +
> + return -EPERM;
> + }
> +
> + /* Okay, we here it means that we want to collect all mount
> + * parameters, and then check it all at one time when
> + * fc->source will be known.
> + */
> + return __vfs_add_monolithic_fs_param(fc, param);
> +
> +non_lazy_way:
> + return vfs_parse_fs_param(fc, param);
> +}
> +EXPORT_SYMBOL(vfs_parse_fs_param_lazy);
> +#else
> +static inline int fscontext_init_lazy_opts(struct fs_context *fc)
> +{
> + fc->lazy_opts = NULL;
> + return 0;
> +}
> +#endif
> +
> /**
> * vfs_parse_fs_string - Convenience function to just parse a string.
> */
> @@ -202,6 +362,36 @@ int generic_parse_monolithic(struct fs_context *fc, void *data)
> {
> char *options = data, *key;
> int ret = 0;
> +#ifdef CONFIG_VE
> + void *options_orig = options;
> + void *options_after;
> + struct ve_struct *ve = get_exec_env();
> +
> + if (!ve_is_super(ve) && (fc->fs_type->fs_flags & FS_REQUIRES_DEV)) {
> + dev_t bd_dev;
> +
> + ret = fscontext_lookup_bdev(fc, &bd_dev);
> + if (ret) {
> + errorf(fc, "%s: Can't lookup blockdev", fc->source);
> + return -ENODEV;
> + }
> +
> + ret = ve_devmnt_process(ve, bd_dev, (void **) &options,
> + fc->purpose == FS_CONTEXT_FOR_RECONFIGURE);
> + if (ret)
> + return ret;
> + }
> +
> + /*
> + * Very dangerous place.
> + * 1. ve_devmnt_process may alloc new page and write address to options variable.
> + * 2. we have to detect such case and call free_page() if so, but
> + * 3. we can free page after all options processing, but during that
> + * strsep() function modifies options variable too. Ugh.
> + * Let's save page address to options_after and use it to detect new page allocation.
> + */
> + options_after = options;
> +#endif
>
> if (!options)
> return 0;
> @@ -227,6 +417,11 @@ int generic_parse_monolithic(struct fs_context *fc, void *data)
> }
> }
>
> +#ifdef CONFIG_VE
> + if (options_after != options_orig)
> + free_page((unsigned long)options_after);
> +#endif
> +
> return ret;
> }
> EXPORT_SYMBOL(generic_parse_monolithic);
> @@ -290,7 +485,13 @@ static struct fs_context *alloc_fs_context(struct file_system_type *fs_type,
> ret = init_fs_context(fc);
> if (ret < 0)
> goto err_fc;
> +
> fc->need_free = true;
> +
> + ret = fscontext_init_lazy_opts(fc);
> + if (ret < 0)
> + goto err_fc;
> +
> return fc;
>
> err_fc:
> @@ -366,6 +567,10 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
> if (ret < 0)
> goto err_fc;
>
> + ret = fscontext_init_lazy_opts(fc);
> + if (ret < 0)
> + goto err_fc;
> +
> ret = security_fs_context_dup(fc, src_fc);
> if (ret < 0)
> goto err_fc;
> @@ -474,6 +679,8 @@ void put_fs_context(struct fs_context *fc)
> put_cred(fc->cred);
> put_fc_log(fc);
> put_filesystem(fc->fs_type);
> + if (fc->lazy_opts)
> + free_page((unsigned long)fc->lazy_opts);
> kfree(fc->source);
> kfree(fc);
> }
> @@ -689,6 +896,10 @@ void vfs_clean_context(struct fs_context *fc)
> fc->s_fs_info = NULL;
> fc->sb_flags = 0;
> security_free_mnt_opts(&fc->security);
> + if (fc->lazy_opts) {
> + free_page((unsigned long)fc->lazy_opts);
> + fc->lazy_opts = NULL;
> + }
> kfree(fc->source);
> fc->source = NULL;
>
> @@ -711,7 +922,15 @@ int finish_clean_context(struct fs_context *fc)
> fc->phase = FS_CONTEXT_FAILED;
> return error;
> }
> +
> fc->need_free = true;
> +
> + error = fscontext_init_lazy_opts(fc);
> + if (unlikely(error)) {
> + fc->phase = FS_CONTEXT_FAILED;
> + return error;
> + }
> +
> fc->phase = FS_CONTEXT_RECONF_PARAMS;
> return 0;
> }
> diff --git a/fs/fsopen.c b/fs/fsopen.c
> index 27a890aa493a..7c872191fa45 100644
> --- a/fs/fsopen.c
> +++ b/fs/fsopen.c
> @@ -209,6 +209,25 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
> return ret;
> }
>
> +static int fscontext_finalize_lazy_opts(struct fs_context *fc)
> +{
> +#ifdef CONFIG_VE
> + if (fc->lazy_opts) {
> + int ret;
> +
> + /* Now we have fc->source set and can do all delayed
> + * devmnt checks. We need just to call
> + * generic_parse_monolithic on our fc->lazy_opts.
> + */
> + ret = generic_parse_monolithic(fc, fc->lazy_opts);
> + if (ret)
> + return ret;
> + }
> +
> + return 0;
> +#endif
> +}
> +
> /*
> * Check the state and apply the configuration. Note that this function is
> * allowed to 'steal' the value by setting param->xxx to NULL before returning.
> @@ -228,6 +247,11 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
> return -EBUSY;
> if (!mount_capable(fc))
> return -EPERM;
> +
> + ret = fscontext_finalize_lazy_opts(fc);
> + if (ret)
> + return ret;
> +
> fc->phase = FS_CONTEXT_CREATING;
> ret = vfs_get_tree(fc);
> if (ret)
> @@ -250,6 +274,11 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
> ret = -EPERM;
> break;
> }
> +
> + ret = fscontext_finalize_lazy_opts(fc);
> + if (ret)
> + return ret;
> +
> down_write(&sb->s_umount);
> ret = reconfigure_super(fc);
> up_write(&sb->s_umount);
> @@ -262,7 +291,7 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
> fc->phase != FS_CONTEXT_RECONF_PARAMS)
> return -EBUSY;
>
> - return vfs_parse_fs_param(fc, param);
> + return vfs_parse_fs_param_lazy(fc, param);
> }
> fc->phase = FS_CONTEXT_FAILED;
> return ret;
> diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
> index 6b54982fc5f3..9d966c85d0a1 100644
> --- a/include/linux/fs_context.h
> +++ b/include/linux/fs_context.h
> @@ -92,6 +92,7 @@ struct fs_context {
> struct mutex uapi_mutex; /* Userspace access mutex */
> struct file_system_type *fs_type;
> void *fs_private; /* The filesystem's context */
> + void *lazy_opts; /* mount options which can't be checked at fsconfig() time */
> void *sget_key;
> struct dentry *root; /* The root and superblock */
> struct user_namespace *user_ns; /* The user namespace for this mount */
> @@ -134,6 +135,14 @@ extern struct fs_context *fs_context_for_submount(struct file_system_type *fs_ty
>
> extern struct fs_context *vfs_dup_fs_context(struct fs_context *fc);
> extern int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param);
> +#ifdef CONFIG_VE
> +extern int vfs_parse_fs_param_lazy(struct fs_context *fc, struct fs_parameter *param);
> +#else
> +static inline int vfs_parse_fs_param_lazy(struct fs_context *fc, struct fs_parameter *param)
> +{
> + return vfs_parse_fs_param(fc, param);
> +}
> +#endif
> extern int vfs_parse_fs_string(struct fs_context *fc, const char *key,
> const char *value, size_t v_size);
> extern int generic_parse_monolithic(struct fs_context *fc, void *data);
>
--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
More information about the Devel
mailing list