[Devel] [RFC PATCH RH9] fsopen, devmnt, fs_context: add vz devmnt feature support to fs_context
Alexander Mikhalitsyn
alexander.mikhalitsyn at virtuozzo.com
Thu Dec 2 19:09:56 MSK 2021
That's RFC patch.
Let's start from some intro. We have devmnt feature which allows
to manage (by writing to /sys/fs/cgroup/ve/CTID/ve.mount_opts) which
mount options is allowed inside the CT and also allows to set e.g.
hidden options which will be appended to user-defined mount options.
It's used primarily to provide balloon_ino option for ext4 and, from
near time, xfs filesystems. User is allowed to mount this filesystems
from inside VE, but of course, user may omit balloon_ino options,
in this case devmnt append it automatically from the kernel side.
Some time ago the a bunch of a new kernel syscalls was added it's
fsopen, fspick, fsconfig, fsmount, move_mount, open_tree [1]. All
of this new kernel API allows userspace to work with detached
VFS trees and also attach this detached trees to mount namespace
(using move_mount syscall). Our devmnt feature implementation hooks
into fs/super.c and fs/namespace.c to catch two possible places
where user may walk. One case is mounting and the second case is
remounting. But after this new fsopen subsystem was appeared
we got major changes in the in-kernel filesystems API, now
filesystems may define their-own fc->ops->get_tree() callback
which have to set root-dentry (fc->root) of the future mount.
If filesystem is not converted to the new API then default
get_tree callback implementation is used (legacy_get_tree).
1. ext4 filesystem uses legacy_get_tree
2. xfs filesystem uses her-own ->get_tree implementation
(using generic get_tree_bdev helper inside).
When legacy_get_tree is used, then old-known mount_bdev()
function is called (and we have corresponding devmnt hook
inside it!). But if filesystem uses her-own get_tree implementation
then it utilizes generic get_tree_bdev helper (in the most cases).
This helper not contains our hooks. You may say, what's the problem
just add our hook inside this function and we will be happy. But
that's not work. Why? Because it's too late to check mount options
in get_tree_bdev() because all options already inside fs-specific
structures and we have no generic way to extract it and check.
Another serious difficulty here, is that user may mount fs like this:
fd = fsopen("ext4", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
The problem here is that we get source block device as usual option,
and user may, for instance, specify all mount options at first, and
at the end of configuring stage provide source block device path.
But devmnt subsystem do checks basing on block device major & minor
numbers. It means that'we can't just hook fsconfig() and check all
options on time. We have to postpone this check at some "final stage"
when we know source blkdev.
Okay, what we will do?
1. Let's modify fsconfig() syscall implementation, so we
will catch filesystems of our interest (backed by block device
+ !ve_is_super) and do not paththrough parameters to the filesystem directly
but just save all user-specified parameters to temporary
buffer fc->lazy_opts.
2. On FSCONFIG_CMD_CREATE stage we will take our builded mount options
string fc->lazy_opts and check all parameters using ve_devmnt_process
function.
Is this ideal solution? No.
That's why:
1. fsconfig provides filesystem developers a way to get non-standard forms
of filesystem mount options. For instance, filesystem may get file descriptor,
path, binary blob. Of course in such cases we can't just concat such data in
the fc->lazy_opts. This approach works only if filesystem uses classical parameters
like flags and strings. Fortunately, xfs and ext4 in this set of fs'es.
2. Not simple changes in the fsopen code
https://jira.sw.ru/browse/PSBM-133521
[1] http://kernsec.org/pipermail/linux-security-module-archive/2019-February/011442.html
Cc: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
Cc: Kirill Tkhai <ktkhai at virtuozzo.com>
Cc: Cyrill Gorcunov <gorcunov at gmail.com>
Cc: Konstantin Khorenko <khorenko at virtuozzo.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn at virtuozzo.com>
---
fs/fs_context.c | 198 +++++++++++++++++++++++++++++++++++++
fs/fsopen.c | 31 +++++-
include/linux/fs_context.h | 9 ++
3 files changed, 237 insertions(+), 1 deletion(-)
diff --git a/fs/fs_context.c b/fs/fs_context.c
index de1985eae535..06fcee02a977 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -11,6 +11,7 @@
#include <linux/fs_context.h>
#include <linux/fs_parser.h>
#include <linux/fs.h>
+#include <linux/blkdev.h>
#include <linux/mount.h>
#include <linux/nsproxy.h>
#include <linux/slab.h>
@@ -19,6 +20,7 @@
#include <linux/mnt_namespace.h>
#include <linux/pid_namespace.h>
#include <linux/user_namespace.h>
+#include <linux/ve.h>
#include <net/net_namespace.h>
#include <asm/sections.h>
#include "mount.h"
@@ -160,6 +162,148 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
}
EXPORT_SYMBOL(vfs_parse_fs_param);
+#ifdef CONFIG_VE
+int __vfs_add_monolithic_fs_param(struct fs_context *fc, struct fs_parameter *param)
+{
+ char *opts = fc->lazy_opts;
+ size_t free_size;
+
+ BUG_ON(!fc->lazy_opts);
+
+ /* lazy_opts length + place for 0-byte */
+ free_size = PAGE_SIZE - (strlen(fc->lazy_opts) + 1);
+
+ /* param->key length + sizeof(',') */
+ if (free_size < (strlen(param->key) + 1))
+ return -ENOMEM;
+
+ strcat(opts, param->key);
+
+ if (param->type == fs_value_is_string) {
+ /* param->string length + sizeof('=') */
+ if (free_size < (strlen(param->string) + 1))
+ return -ENOMEM;
+
+ strcat(opts, "=");
+ strcat(opts, param->string);
+ }
+
+ strcat(opts, ",");
+
+ return 0;
+}
+
+static inline int fscontext_lookup_bdev(struct fs_context *fc, dev_t *s_dev)
+{
+ dev_t bd_dev;
+ int ret;
+
+ /* can we reach fc->root->d_sb->s_dev ? */
+ if (fc->root && fc->root->d_sb && fc->root->d_sb) {
+ if (s_dev)
+ *s_dev = fc->root->d_sb->s_dev;
+
+ return 0;
+ }
+
+ if (fc->source) {
+ ret = lookup_bdev(fc->source, &bd_dev);
+ if (ret)
+ return 1;
+
+ if (s_dev)
+ *s_dev = bd_dev;
+
+ return 0;
+ }
+
+ return 1;
+}
+
+static int fscontext_init_lazy_opts(struct fs_context *fc)
+{
+ struct ve_struct *ve = get_exec_env();
+ int fs_flags;
+ char *opts;
+
+ if (ve_is_super(ve))
+ return 0;
+
+ /* Currently, fc->fs_type is filled in fsopen(),
+ * so this situation is impossible, but let's
+ * be cautious as fs_context implementation may
+ * be changed in the future.
+ */
+ if (!fc->fs_type) {
+ WARN_ON(1);
+ return 0;
+ }
+
+ fs_flags = fc->fs_type->fs_flags;
+
+ /* If we already know block device then we can do devmnt checks,
+ * if fs doesn't require block dev we have to skip it
+ */
+ if (!(fs_flags & FS_REQUIRES_DEV) || !fscontext_lookup_bdev(fc, NULL))
+ return 0;
+
+ /* we interested in filesystems which can be mounted from inside VE
+ */
+ if (!(fs_flags & FS_VIRTUALIZED || fs_flags & FS_VE_MOUNT))
+ return 0;
+
+ fc->lazy_opts = (void *)__get_free_page(GFP_KERNEL);
+ if (!fc->lazy_opts)
+ return -ENOMEM;
+
+ opts = fc->lazy_opts;
+
+ /* we will use fc->lazy_opts as a string */
+ *opts = 0;
+
+ return 0;
+}
+
+int vfs_parse_fs_param_lazy(struct fs_context *fc, struct fs_parameter *param)
+{
+ struct ve_struct *ve = get_exec_env();
+
+ /* source parameter should be passed over lazy opts, as it
+ * describes source bdev for fs
+ */
+ if (!strcmp(param->key, "source"))
+ goto non_lazy_way;
+
+ if (!(param->type == fs_value_is_flag ||
+ param->type == fs_value_is_string)) {
+ /* In case when we can't serialize fs parameters into
+ * a string it means that this fs uses blob parameters,
+ * or fd parameters, or something similar. We can't use
+ * devmnt to control such mounts.
+ */
+ ve_pr_warn_ratelimited(VE0_LOG, "VE%s: can't do devmnt checks "
+ "for fstype %s (param key %s, param type %d)\n",
+ ve->ve_name,
+ fc->fs_type->name,
+ param->key,
+ param->type
+ );
+
+ return -EPERM;
+ }
+
+ /* Okay, we here it means that we want to collect all mount
+ * parameters, and then check it all at one time when
+ * fc->source will be known.
+ */
+ return __vfs_add_monolithic_fs_param(fc, param);
+
+non_lazy_way:
+ return vfs_parse_fs_param(fc, param);
+}
+EXPORT_SYMBOL(vfs_parse_fs_param_lazy);
+#endif
+
/**
* vfs_parse_fs_string - Convenience function to just parse a string.
*/
@@ -202,6 +346,36 @@ int generic_parse_monolithic(struct fs_context *fc, void *data)
{
char *options = data, *key;
int ret = 0;
+#ifdef CONFIG_VE
+ void *options_orig = options;
+ void *options_after;
+ struct ve_struct *ve = get_exec_env();
+
+ if (!ve_is_super(ve) && (fc->fs_type->fs_flags & FS_REQUIRES_DEV)) {
+ dev_t bd_dev;
+
+ ret = fscontext_lookup_bdev(fc, &bd_dev);
+ if (ret) {
+ errorf(fc, "%s: Can't lookup blockdev", fc->source);
+ return -ENODEV;
+ }
+
+ ret = ve_devmnt_process(ve, bd_dev, (void **) &options,
+ fc->phase == FS_CONTEXT_RECONFIGURING);
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * Very dangerous place.
+ * 1. ve_devmnt_process may alloc new page and write address to options variable.
+ * 2. we have to detect such case and call free_page() if so, but
+ * 3. we can free page after all options processing, but during that
+ * strsep() function modifies options variable too. Ugh.
+ * Let's save page address to options_after and use it to detect new page allocation.
+ */
+ options_after = options;
+#endif
if (!options)
return 0;
@@ -227,6 +401,11 @@ int generic_parse_monolithic(struct fs_context *fc, void *data)
}
}
+#ifdef CONFIG_VE
+ if (options_after != options_orig)
+ free_page((unsigned long)options_after);
+#endif
+
return ret;
}
EXPORT_SYMBOL(generic_parse_monolithic);
@@ -290,6 +469,18 @@ static struct fs_context *alloc_fs_context(struct file_system_type *fs_type,
ret = init_fs_context(fc);
if (ret < 0)
goto err_fc;
+
+#ifdef CONFIG_VE
+ if (purpose == FS_CONTEXT_FOR_MOUNT ||
+ purpose == FS_CONTEXT_FOR_RECONFIGURE) {
+ ret = fscontext_init_lazy_opts(fc);
+ if (ret < 0)
+ goto err_fc;
+ }
+#else
+ fc->lazy_opts = NULL;
+#endif
+
fc->need_free = true;
return fc;
@@ -351,6 +542,7 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
mutex_init(&fc->uapi_mutex);
fc->fs_private = NULL;
+ fc->lazy_opts = NULL;
fc->s_fs_info = NULL;
fc->source = NULL;
fc->security = NULL;
@@ -474,6 +666,8 @@ void put_fs_context(struct fs_context *fc)
put_cred(fc->cred);
put_fc_log(fc);
put_filesystem(fc->fs_type);
+ if (fc->lazy_opts)
+ free_page((unsigned long)fc->lazy_opts);
kfree(fc->source);
kfree(fc);
}
@@ -689,6 +883,10 @@ void vfs_clean_context(struct fs_context *fc)
fc->s_fs_info = NULL;
fc->sb_flags = 0;
security_free_mnt_opts(&fc->security);
+ if (fc->lazy_opts) {
+ free_page((unsigned long)fc->lazy_opts);
+ fc->lazy_opts = NULL;
+ }
kfree(fc->source);
fc->source = NULL;
diff --git a/fs/fsopen.c b/fs/fsopen.c
index 27a890aa493a..7c872191fa45 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -209,6 +209,25 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
return ret;
}
+static int fscontext_finalize_lazy_opts(struct fs_context *fc)
+{
+#ifdef CONFIG_VE
+ if (fc->lazy_opts) {
+ int ret;
+
+ /* Now we have fc->source set and can do all delayed
+ * devmnt checks. We need just to call
+ * generic_parse_monolithic on our fc->lazy_opts.
+ */
+ ret = generic_parse_monolithic(fc, fc->lazy_opts);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+#endif
+}
+
/*
* Check the state and apply the configuration. Note that this function is
* allowed to 'steal' the value by setting param->xxx to NULL before returning.
@@ -228,6 +247,11 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
return -EBUSY;
if (!mount_capable(fc))
return -EPERM;
+
+ ret = fscontext_finalize_lazy_opts(fc);
+ if (ret)
+ return ret;
+
fc->phase = FS_CONTEXT_CREATING;
ret = vfs_get_tree(fc);
if (ret)
@@ -250,6 +274,11 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
ret = -EPERM;
break;
}
+
+ ret = fscontext_finalize_lazy_opts(fc);
+ if (ret)
+ return ret;
+
down_write(&sb->s_umount);
ret = reconfigure_super(fc);
up_write(&sb->s_umount);
@@ -262,7 +291,7 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
fc->phase != FS_CONTEXT_RECONF_PARAMS)
return -EBUSY;
- return vfs_parse_fs_param(fc, param);
+ return vfs_parse_fs_param_lazy(fc, param);
}
fc->phase = FS_CONTEXT_FAILED;
return ret;
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 6b54982fc5f3..9d966c85d0a1 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -92,6 +92,7 @@ struct fs_context {
struct mutex uapi_mutex; /* Userspace access mutex */
struct file_system_type *fs_type;
void *fs_private; /* The filesystem's context */
+ void *lazy_opts; /* mount options which can't be checked at fsconfig() time */
void *sget_key;
struct dentry *root; /* The root and superblock */
struct user_namespace *user_ns; /* The user namespace for this mount */
@@ -134,6 +135,14 @@ extern struct fs_context *fs_context_for_submount(struct file_system_type *fs_ty
extern struct fs_context *vfs_dup_fs_context(struct fs_context *fc);
extern int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param);
+#ifdef CONFIG_VE
+extern int vfs_parse_fs_param_lazy(struct fs_context *fc, struct fs_parameter *param);
+#else
+static inline int vfs_parse_fs_param_lazy(struct fs_context *fc, struct fs_parameter *param)
+{
+ return vfs_parse_fs_param(fc, param);
+}
+#endif
extern int vfs_parse_fs_string(struct fs_context *fc, const char *key,
const char *value, size_t v_size);
extern int generic_parse_monolithic(struct fs_context *fc, void *data);
--
2.27.0
More information about the Devel
mailing list