[Devel] [PATCH RHEL8 COMMIT] ve/fs/binfmt: virtualization
Konstantin Khorenko
khorenko at virtuozzo.com
Thu May 28 12:24:18 MSK 2020
The commit is pushed to "branch-rh8-4.18.0-80.1.2.vz8.3.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-80.1.2.vz8.3.9
------>
commit 0bf21f8b74cac77825663cfb6502707199448409
Author: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
Date: Thu May 28 12:24:17 2020 +0300
ve/fs/binfmt: virtualization
* keep deference from binfmt_misc sb to ve
* store pointer to binfmt_misc data in ve->binfmt_misc
Here bm_put_super() can race with load_misc_binary() caller, which is working
with get_exec_env()->binfmt_misc.
Will be fixed separately.
Signed-off-by: Konstantin Khlebnikov <khlebnikov at openvz.org>
+++
VE/BINFTM: fix destruction ordering
kill binfmt_data together with ve_struct
Signed-off-by: Konstantin Khlebnikov <khlebnikov at openvz.org>
+++
ve/binfmt_misc: fix compilation outside CONFIG_BINFMT_MISC
Fix for the next compile error:
kernel/ve/ve.c: In function âve_destroyâ:
kernel/ve/ve.c:709:10: error: âstruct ve_structâ has no member named âbinfmt_miscâ
kfree(ve->binfmt_misc);
mFixes 4b7d610d45498ac733e92024097dc99402476b27
("VE/BINFTM: fix destruction ordering").
Signed-off-by: Dmitry Safonov <dsafonov at virtuozzo.com>
+++
ve/binfmt_misc: do not use sb->s_fs_info
Patchset description:
zap sb->s_ns + fix memleak in binfmt_misc
Vladimir Davydov (6):
binfmt_misc: do not use sb->s_fs_info
Revert "VE/VFS: use sb->s_ns member to store namespace for mount_ns()
calls"
Revert "ve/sunrpc: use correct pointer to net_namespace in auth_gss.c"
Revert "nfsd/sunrpc/mqueue: use sb->s_ns instead of data in
fill_super"
binfmt_misc: do not use s_ns
binfmt_misc: destroy all nodes on ve stop
https://jira.sw.ru/browse/PSBM-39154
Reviewed-by: Cyrill Gorcunov <gorcunov at virtuozzo.com>
======================
This patch description:
When we virtualized binfmt_misc, we made sb->s_fs_info store a pointer
to binfmt_misc struct. At the same time, we store a pointer to the owner
ve_struct in sb->s_ns and a pointer to the same binfmt_misc struct in
ve_struct->binfmt_misc. That said, we don't actually need to use
s_fs_info, because we can get the binfmt_misc by dereferencing
sb->s_ns->binfmt_misc.
Using sb->s_fs_info instead of sb->s_ns will allow us to revert our
patches introducing sb->s_ns.
This could be merged to 0b0dbb644794 ("VE/BINFTM: virtualization").
Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
+++
ve/binfmt_misc: do not use s_ns
Patchset description:
zap sb->s_ns + fix memleak in binfmt_misc
Vladimir Davydov (6):
binfmt_misc: do not use sb->s_fs_info
Revert "VE/VFS: use sb->s_ns member to store namespace for mount_ns()
calls"
Revert "ve/sunrpc: use correct pointer to net_namespace in auth_gss.c"
Revert "nfsd/sunrpc/mqueue: use sb->s_ns instead of data in
fill_super"
binfmt_misc: do not use s_ns
binfmt_misc: destroy all nodes on ve stop
https://jira.sw.ru/browse/PSBM-39154
Reviewed-by: Cyrill Gorcunov <gorcunov at virtuozzo.com>
======================
This patch description:
Since 9e7411c5c3b5 was reverted, we must use sb->s_fs_info for storing a
pointer to the namespace.
This could be merged to 0b0dbb644794 ("VE/BINFTM: virtualization").
Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
+++
ve/fs: Use ve_printk in fs/binfmt_aout.c
This is a part of 74-diff-ve-mix-combined.
https://jira.sw.ru/browse/PSBM-17903
Signed-off-by: Kirill Tkhai <ktkhai at parallels.com>
======================
ve/fs: Allow to mount binfmt_misc under non-root ns
https://jira.sw.ru/browse/PSBM-40100
v2: Check that user_ns is initial for the ve.
v3: Be sure ve->init_cred is set.
Signed-off-by: Kirill Tkhai <ktkhai at odin.com>
Acked-by: Vladimir Davydov <vdavydov at virtuozzo.com>
khorenko@: in fact we allowed to do those mounts in top CT user ns only.
======================
ve/binfmt_misc: Allow mount if capable(CAP_SYS_ADMIN)
The patch allows to mount binfmt_misc in a CT with ve0's admin caps,
and it's need that for CRIU dump. This time, unmounted binfmt_misc
may be forced mounted back, and we don't want to change CRIU's user_ns
to do that.
https://jira.sw.ru/browse/PSBM-47737
Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
Reviewed-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
+++
ve/fs/binfmt_misc: store link to ve in sb->s_fs_info
Fixes: 70a53f72f929 ("ve/fs/binfmt: store link to ve in sb->s_fs_info")
https://jira.sw.ru/browse/PSBM-85685
Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
+++
Overrides:
ve/fs/binfmt: store link to ve in sb->s_fs_info
After rebase to RHEL7.5 sb->s_fs_info by default contains a link to "ns"
provided to mount_ns(), but in binfmt_misc code we need a link to ve
there, so adjust bm_fill_super() accordingly.
https://jira.sw.ru/browse/PSBM-85052
Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
(cherry picked from commit bd9b1e8d6f856300df13e955340ed5a2e89d1b56 due
to bug https://jira.sw.ru/browse/PSBM-103973)
Signed-off-by: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
---
fs/binfmt_aout.c | 6 +++---
fs/binfmt_misc.c | 63 +++++++++++++++++++++++++++++++++++++++++-------------
include/linux/ve.h | 4 ++++
kernel/ve/ve.c | 3 +++
4 files changed, 58 insertions(+), 18 deletions(-)
diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
index c3deb2e35f20..c052299136ed 100644
--- a/fs/binfmt_aout.c
+++ b/fs/binfmt_aout.c
@@ -283,12 +283,12 @@ static int load_aout_binary(struct linux_binprm * bprm)
if ((ex.a_text & 0xfff || ex.a_data & 0xfff) &&
(N_MAGIC(ex) != NMAGIC) && printk_ratelimit())
{
- printk(KERN_NOTICE "executable not page aligned\n");
+ ve_printk(VE_LOG, KERN_NOTICE "executable not page aligned\n");
}
if ((fd_offset & ~PAGE_MASK) != 0 && printk_ratelimit())
{
- printk(KERN_WARNING
+ ve_printk(VE_LOG, KERN_WARNING
"fd_offset is not page aligned. Please convert program: %pD\n",
bprm->file);
}
@@ -376,7 +376,7 @@ static int load_aout_library(struct file *file)
if ((N_TXTOFF(ex) & ~PAGE_MASK) != 0) {
if (printk_ratelimit())
{
- printk(KERN_WARNING
+ ve_printk(VE_LOG, KERN_WARNING
"N_TXTOFF is not page aligned. Please convert library: %pD\n",
file);
}
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index 97bd44f1b297..0288a0ee04ac 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -24,6 +24,8 @@
#include <linux/mount.h>
#include <linux/syscalls.h>
#include <linux/fs.h>
+#include <linux/ve.h>
+
#include <linux/uaccess.h>
#include "internal.h"
@@ -68,11 +70,7 @@ struct binfmt_misc {
int entry_count;
};
-struct binfmt_misc binfmt_data = {
- .entries = LIST_HEAD_INIT(binfmt_data.entries),
- .enabled = 1,
- .entries_lock = __RW_LOCK_UNLOCKED(binfmt_data.entries_lock),
-};
+#define BINFMT_MISC(sb) (((struct ve_struct *)(sb)->s_fs_info)->binfmt_misc)
/*
* Max length of the register string. Determined by:
@@ -142,7 +140,7 @@ static int load_misc_binary(struct linux_binprm *bprm)
struct file *interp_file = NULL;
int retval;
int fd_binary = -1;
- struct binfmt_misc *bm_data = &binfmt_data;
+ struct binfmt_misc *bm_data = get_exec_env()->binfmt_misc;
retval = -ENOEXEC;
if (!bm_data || !bm_data->enabled)
@@ -662,7 +660,7 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
Node *e = file_inode(file)->i_private;
int res = parse_command(buffer, count);
struct super_block *sb = file->f_path.dentry->d_sb;
- struct binfmt_misc *bm_data = sb->s_fs_info;
+ struct binfmt_misc *bm_data = BINFMT_MISC(sb);
switch (res) {
case 1:
@@ -704,8 +702,8 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
Node *e;
struct inode *inode;
struct super_block *sb = file_inode(file)->i_sb;
- struct binfmt_misc *bm_data = sb->s_fs_info;
struct dentry *root = sb->s_root, *dentry;
+ struct binfmt_misc *bm_data = BINFMT_MISC(sb);
int err = 0;
e = create_entry(buffer, count);
@@ -783,7 +781,7 @@ static const struct file_operations bm_register_operations = {
static ssize_t
bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
{
- struct binfmt_misc *bm_data = file->f_path.dentry->d_sb->s_fs_info;
+ struct binfmt_misc *bm_data = BINFMT_MISC(file->f_path.dentry->d_sb);
char *s = bm_data->enabled ? "enabled\n" : "disabled\n";
return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
@@ -792,7 +790,7 @@ bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
static ssize_t bm_status_write(struct file *file, const char __user *buffer,
size_t count, loff_t *ppos)
{
- struct binfmt_misc *bm_data = file->f_path.dentry->d_sb->s_fs_info;
+ struct binfmt_misc *bm_data = BINFMT_MISC(file->f_path.dentry->d_sb);
int res = parse_command(buffer, count);
struct dentry *root;
@@ -831,9 +829,19 @@ static const struct file_operations bm_status_operations = {
/* Superblock handling */
+static void bm_put_super(struct super_block *sb)
+{
+ struct binfmt_misc *bm_data = BINFMT_MISC(sb);
+ struct ve_struct *ve = sb->s_fs_info;
+
+ bm_data->enabled = 0;
+ put_ve(ve);
+}
+
static const struct super_operations s_ops = {
.statfs = simple_statfs,
.evict_inode = bm_evict_inode,
+ .put_super = bm_put_super,
};
static int bm_fill_super(struct super_block *sb, void *data, int silent)
@@ -845,18 +853,42 @@ static int bm_fill_super(struct super_block *sb, void *data, int silent)
/* last one */ {""}
};
+ struct ve_struct *ve = data;
+ struct binfmt_misc *bm_data = ve->binfmt_misc;
+
+ if (!bm_data) {
+ bm_data = kzalloc(sizeof(struct binfmt_misc), GFP_KERNEL);
+ if (!bm_data)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&bm_data->entries);
+ rwlock_init(&bm_data->entries_lock);
+
+ ve->binfmt_misc = bm_data;
+ }
+
err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files);
- if (!err) {
- sb->s_op = &s_ops;
- sb->s_fs_info = &binfmt_data;
+ if (err) {
+ kfree(bm_data);
+ return err;
}
- return err;
+
+ sb->s_op = &s_ops;
+
+ bm_data->enabled = 1;
+ get_ve(ve);
+
+ return 0;
}
static struct dentry *bm_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
- return mount_single(fs_type, flags, data, bm_fill_super);
+ if (!current_user_ns_initial() && !capable(CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+
+ return mount_ns(fs_type, flags, get_exec_env(), get_exec_env(),
+ current_user_ns(), bm_fill_super);
}
static struct linux_binfmt misc_format = {
@@ -869,6 +901,7 @@ static struct file_system_type bm_fs_type = {
.name = "binfmt_misc",
.mount = bm_mount,
.kill_sb = kill_litter_super,
+ .fs_flags = FS_VIRTUALIZED | FS_USERNS_MOUNT,
};
MODULE_ALIAS_FS("binfmt_misc");
diff --git a/include/linux/ve.h b/include/linux/ve.h
index 7eaa0421d689..ba84d3058ad2 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -74,6 +74,10 @@ struct ve_struct {
struct super_block *dev_sb;
+#if IS_ENABLED(CONFIG_BINFMT_MISC)
+ struct binfmt_misc *binfmt_misc;
+#endif
+
struct kmapset_key sysfs_perms_key;
atomic_t netns_avail_nr;
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 8b83058b1423..bf9f06db7cff 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -624,6 +624,9 @@ static void ve_destroy(struct cgroup_subsys_state *css)
kmapset_unlink(&ve->sysfs_perms_key, &sysfs_ve_perms_set);
ve_log_destroy(ve);
+#if IS_ENABLED(CONFIG_BINFMT_MISC)
+ kfree(ve->binfmt_misc);
+#endif
free_percpu(ve->sched_lat_ve.cur);
kmem_cache_free(ve_cachep, ve);
}
More information about the Devel
mailing list