[Devel] [PATCH RH9 14/14] ve/vfs: introduce "fs.odirect_enable" sysctl and disable it by default

Andrey Zhadchenko andrey.zhadchenko at virtuozzo.com
Mon Oct 4 12:17:29 MSK 2021


From: Konstantin Khorenko <khorenko at virtuozzo.com>

We've observed a situation when in case of many Containers on a node
even small direct disk io in each CT brings the whole node to knees
(100 CTs, 5 lines of logs written each 20-30 seconds).
The node had surely slow hdds.

Note, that this significantly slows down async reads: they can be direct
only, if they are called in cached mode, they effectively became
synchronous in case > 1 writers.

Example:
 # fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 \
   --name=test --filename=test --bs=4k --iodepth=64 --size=1G \
   --readwrite=randrw --rwmixread=75

The vps here resulted in 20MB/s read and 6.8MB/s write, while other VPS
(with O_DIRECT enabled) resulted in 230MB/s read and 76MB/s write.

The root cause is known: libaio becomes synchronous in case of cached io.

So the userspace is better check if underlying disk is fast enough and
enable O_DIRECT in those cases.

https://jira.sw.ru/browse/PSBM-53458
https://jira.sw.ru/browse/PSBM-68005
https://jira.sw.ru/browse/PSBM-68656
https://jira.sw.ru/browse/PSBM-100671
https://jira.sw.ru/browse/PSBM-104338

Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>

===============================================================
===============================================================
Original commit message:

 commit f5829bccbd390437013bd914d68caabf79d09b3e
 Author: Konstantin Khorenko <khorenko at virtuozzo.com>
 Date:   Mon Dec 11 23:00:45 2017 +0300

    ve/fs: introduce "fs.fsync-enable" and "fs.odirect_enable" sysctls

    ve/vfs: introduce "odirect_enable" sysctl and disable it by default

    khorenko@: we want to disable direct access from inside Container
            because this is limited numbers of direct requests available
            on the system (128), and in case they are busy next request
            is provided only after some requst is completed.
            There is no any scheduler at this level => DDoS is possible
            from inside a CT: just run _many_ processes writing with O_DIRECT.

    diff-vfs-odirect-enable && diff-vfs-odirect-enable-location-fix

    Signed-off-by: Kirill Tkhai <ktkhai at parallels.com>

    +++
    ve/fs: Port fs.fsync-enable and fs.odirect_enable sysctls

    This is a part of 74-diff-ve-mix-combined.

    https://jira.sw.ru/browse/PSBM-17903

    Signed-off-by: Kirill Tkhai <ktkhai at parallels.com>

    =====================================================

    ve/fs: check container odirect and fsync settings in __dentry_open

    sys_open for conventional filesystems doesn't call dentry_open,
    it calls __dentry_open (in nameidata_to_filp), so we have to move
    checks for odirect and fsync behaviour to __dentry_open
    to make them working on ploop containers.

    https://jira.sw.ru/browse/PSBM-17157

    Signed-off-by: Dmitry Guryanov <dguryanov at parallels.com>

    Acked-by: Dmitry Monakhov <dmonakhov at openvz.org>
    Signed-off-by: Dmitry Monakhov <dmonakhov at openvz.org>

    ================================================

    ve: initialize fsync_enable also for non ve0 environment

    Patchset description:

    ve: fix initialization and remove sysctl_fsync_enable

    v2:
    - initialize only on ve cgroup creation, remove get_ve_features
    - rename setup_iptables_mask into ve_setup_iptables_mask

    https://jira.sw.ru/browse/PSBM-34286
    https://jira.sw.ru/browse/PSBM-34285

    Pavel Tikhomirov (4):
      ve: remove sysctl_fsync_enable and use ve_fsync_behavior instead
      ve: initialize fsync_enable also for non ve0 environment
      ve: iptables: fix mask initialization and changing
      ve: cgroup: initialize odirect_enable, features and _randomize_va_space

    =====================================================================
    This patch description:

    v2: only on ve cgroup creation

    https://jira.sw.ru/browse/PSBM-34286
    Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
    Acked-by: Dmitry Monakhov <dmonakhov at openvz.org>

(cherry picked from vz8 commit 166db3147c1b29b4247e50eeae6c18f4ca88c162)
Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>
---
 fs/fcntl.c         | 30 ++++++++++++++++++++++++++++++
 fs/open.c          |  3 +++
 include/linux/fs.h |  2 ++
 include/linux/ve.h |  1 +
 kernel/sysctl.c    |  7 +++++++
 kernel/ve/ve.c     |  2 ++
 6 files changed, 45 insertions(+)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 714e7c9..2e0c851 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -26,6 +26,7 @@
 #include <linux/memfd.h>
 #include <linux/compat.h>
 #include <linux/mount.h>
+#include <linux/ve.h>
 
 #include <linux/poll.h>
 #include <asm/siginfo.h>
@@ -33,11 +34,40 @@
 
 #define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
 
+/*
+ * Host is always allowed to use O_DIRECT.
+ * Host's value of sysctl "fs.odirect_enable" might affect Containers only.
+ *
+ * Container's "fs.odirect_enable" sysctl value means:
+ *  0: Container ignores O_DIRECT flag
+ *  1: Container honors  O_DIRECT flag (in fact, any X>0 && X != 2)
+ *  2: Container checks the host's sysctl value and work according it
+ */
+int may_use_odirect(void)
+{
+	int may;
+
+	if (ve_is_super(get_exec_env()))
+		return 1;
+
+	may = capable(CAP_SYS_RAWIO);
+	if (!may) {
+		may = get_exec_env()->odirect_enable;
+		if (may == 2)
+			may = get_ve0()->odirect_enable;
+	}
+
+	return may;
+}
+
 static int setfl(int fd, struct file * filp, unsigned long arg)
 {
 	struct inode * inode = file_inode(filp);
 	int error = 0;
 
+	if (!may_use_odirect())
+		arg &= ~O_DIRECT;
+
 	/*
 	 * O_APPEND cannot be cleared if the file is marked as append-only
 	 * and the file is open for write.
diff --git a/fs/open.c b/fs/open.c
index 21c9411..040df8b 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -782,6 +782,9 @@ static int do_dentry_open(struct file *f,
 	f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
 	f->f_sb_err = file_sample_sb_err(f);
 
+	if (!may_use_odirect())
+		f->f_flags &= ~O_DIRECT;
+
 	if (unlikely(f->f_flags & O_PATH)) {
 		f->f_mode = FMODE_PATH | FMODE_OPENED;
 		f->f_op = &empty_fops;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8d86a85..7b34c55 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -186,6 +186,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File supports async buffered reads */
 #define FMODE_BUF_RASYNC	((__force fmode_t)0x40000000)
 
+extern int may_use_odirect(void);
+
 /*
  * Attribute flags.  These should be or-ed together to figure out what
  * has been changed!
diff --git a/include/linux/ve.h b/include/linux/ve.h
index 2cf4a01..829a21d 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -54,6 +54,7 @@ struct ve_struct {
 #define VE_LOG_BUF_LEN		4096
 
 	struct kstat_lat_pcpu_struct    sched_lat_ve;
+	int			odirect_enable;
 
 #if IS_ENABLED(CONFIG_BINFMT_MISC)
 	struct binfmt_misc	*binfmt_misc;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9321aa7..55054f1 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -3444,6 +3444,13 @@ int proc_do_static_key(struct ctl_table *table, int write,
 	},
 #endif
 	{
+		.procname	= "odirect_enable",
+		.data		= &ve0.odirect_enable,
+		.maxlen		= sizeof(int),
+		.mode		= 0644 | S_ISVTX,
+		.proc_handler	= proc_dointvec_virtual,
+	},
+	{
 		.procname	= "pipe-max-size",
 		.data		= &pipe_max_size,
 		.maxlen		= sizeof(pipe_max_size),
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index aae2e51..8d860ab 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -588,6 +588,8 @@ static struct cgroup_subsys_state *ve_create(struct cgroup_subsys_state *parent_
 
 	ve->meminfo_val = VE_MEMINFO_DEFAULT;
 
+	ve->odirect_enable = 2;
+
 	atomic_set(&ve->netns_avail_nr, NETNS_MAX_NR_DEFAULT);
 	ve->netns_max_nr = NETNS_MAX_NR_DEFAULT;
 
-- 
1.8.3.1



More information about the Devel mailing list