[Devel] [PATCH VZ9 10/11] drivers/vhost: add ioctl to increase the number of workers

Andrey Zhadchenko andrey.zhadchenko at virtuozzo.com
Thu Jan 4 20:02:19 MSK 2024


Finally add ioctl to allow userspace to create additional workers
For now only allow to increase the number of workers

https://jira.sw.ru/browse/PSBM-139414
Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>

======
Patchset description:
vhost-blk: in-kernel accelerator for virtio-blk guests

Although QEMU virtio-blk is quite fast, there is still some room for
improvements. Disk latency can be reduced if we handle virito-blk
requests in host kernel so we avoid a lot of syscalls and context
switches.
The idea is quite simple - QEMU gives us block device and we translate
any incoming virtio requests into bio and push them into bdev.
The biggest disadvantage of this vhost-blk flavor is raw format.

Luckily Kirill Thai proposed device mapper driver for QCOW2 format to
attach files as block devices:
https://www.spinics.net/lists/kernel/msg4292965.html

Also by using kernel modules we can bypass iothread limitation and
finaly scale block requests with cpus for high-performance devices.

There have already been several attempts to write vhost-blk:

Asias'   version:	https://lkml.org/lkml/2012/12/1/174
Badari's version:	https://lwn.net/Articles/379864/
Vitaly's version:	https://lwn.net/Articles/770965/

The main difference between them is API to access backend file. The
fastest one is Asias's version with bio flavor. It is also the most
reviewed and have the most features. So vhost_blk module is partially
based on it. Multiple virtqueue support was addded, some places
reworked. Added support for several vhost workers.

test setup and results:
  fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
QEMU drive options: cache=none
filesystem: xfs

SSD:
               | randread, IOPS  | randwrite, IOPS |
Host           |      95.8k	 |	85.3k	   |
QEMU virtio    |      57.5k	 |	79.4k	   |
QEMU vhost-blk |      95.6k	 |	84.3k	   |

RAMDISK (vq == vcpu):
                 | randread, IOPS | randwrite, IOPS |
virtio, 1vcpu    |	123k	  |	 129k       |
virtio, 2vcpu    |	253k (??) |	 250k (??)  |
virtio, 4vcpu    |	158k	  |	 154k       |
vhost-blk, 1vcpu |	110k	  |	 113k       |
vhost-blk, 2vcpu |	247k	  |	 252k       |
vhost-blk, 8vcpu |	497k	  |	 469k       | *single kernel thread
vhost-blk, 8vcpu |      730k      |      701k       | *two kernel threads

v2:

patch 1/10
 - removed unused VHOST_BLK_VQ
 - reworked bio handling a bit: now add all pages from signle iov into
   single bio istead of allocating one bio per page
 - changed how to calculate sector incrementation
 - check move_iovec() in vhost_blk_req_handle()
 - remove snprintf check and better check ret from copy_to_iter for
   VIRTIO_BLK_ID_BYTES requests
 - discard vq request if vhost_blk_req_handle() returned negative code
 - forbid to change nonzero backend in vhost_blk_set_backend(). First of
   all, QEMU sets backend only once. Also if we want to change backend
   when we already running requests we need to be much more careful in
   vhost_blk_handle_guest_kick() as it is not taking any references. If
   userspace want to change backend that bad it can always reset device.
 - removed EXPERIMENTAL from Kconfig

patch 3/10
 - don't bother with checking dev->workers[0].worker since dev->nworkers
   will always contain 0 in this case

patch 6/10
 - Make code do what docs suggest. Previously ioctl-supplied new number
   of workers were treated like an amount that should be added. Use new
   number as a ceiling instead and add workers up to that number.

v3:
patch 1/10
 - reworked bio handling a bit - now create new only if the previous is
   full

patch 2/10
 - set vq->worker = NULL in vhost_vq_reset()

v4:
patch 1/10
 - vhost_blk_req_done() now won't hide errors for multi-bio requests
 - vhost_blk_prepare_req() now better estimates bio_len
 - alloc bio for max pages_nr_total pages instead of nr_pages
 - added new ioctl VHOST_BLK_SET_SERIAL to set serial
 - rework flush alghoritm a bit - now use two bins "new req" and
   "for flush" and swap them at the start of the flush
 - moved backing file dereference to vhost_blk_req_submit() and
   after request was added to flush bin to avoid race in
   vhost_blk_release().  Now even if we dropped backend and started
   flush the request will either be tracked by flush or be rolled back

patch 2/10
 - moved vq->worker = NULL to patch #7 where this field is introduced.

patch 7/10
 - Set vq->worker = NULL in vhost_vq_reset. This will fix both
   https://jira.sw.ru/browse/PSBM-142058
   https://jira.sw.ru/browse/PSBM-142852

v5:
patch 1/10
 - several codestyle/spacing fixes
 - added WARN_ON() for vhost_blk_flush

https://jira.sw.ru/browse/PSBM-139414
Reviewed-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>

Andrey Zhadchenko (10):
  drivers/vhost: vhost-blk accelerator for virtio-blk guests
  drivers/vhost: use array to store workers
  drivers/vhost: adjust vhost to flush all workers
  drivers/vhost: rework attaching cgroups to be worker aware
  drivers/vhost: rework worker creation
  drivers/vhost: add ioctl to increase the number of workers
  drivers/vhost: assign workers to virtqueues
  drivers/vhost: add API to queue work at virtqueue worker
  drivers/vhost: allow polls to be bound to workers via vqs
  drivers/vhost: queue vhost_blk works at vq workers

Feature: vhost-blk: in-kernel accelerator for virtio-blk guests

--------
Rework patch to allow to set a number of workers only once.
Merge with workers assignment patch, as we now assign workers
only from here.

(cherry picked from vz9 commit 3ec868361615ff605778fef583afcaf4e7bc473a)
https://virtuozzo.atlassian.net/browse/PSBM-152375
Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>
---
 drivers/vhost/vhost.c      | 52 ++++++++++++++++++++++++++++++++++++--
 drivers/vhost/vhost.h      |  1 +
 include/uapi/linux/vhost.h |  8 ++++++
 3 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 719b1784a32b..c32557e279df 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -522,7 +522,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	INIT_LIST_HEAD(&dev->pending_list);
 	spin_lock_init(&dev->iotlb_lock);
 	xa_init_flags(&dev->worker_xa, XA_FLAGS_ALLOC);
-
+	dev->workers_set = false;
 
 	for (i = 0; i < dev->nvqs; ++i) {
 		vq = dev->vqs[i];
@@ -704,6 +704,42 @@ static int vhost_get_vq_from_user(struct vhost_dev *dev, void __user *argp,
 	return 0;
 }
 
+static void vhost_propagate_workers(struct vhost_dev *dev, int nworkers)
+{
+	int i, j = 0;
+
+	for (i = nworkers; i < dev->nvqs; i++) {
+		dev->vqs[i]->worker = dev->vqs[j]->worker;
+		if (++j >= nworkers)
+			j = 0;
+	}
+}
+
+
+static int vhost_set_workers(struct vhost_dev *dev, int n)
+{
+	struct vhost_worker *worker;
+	int i, ret = 0;
+
+	if (n > dev->nvqs)
+		n = dev->nvqs;
+
+	dev->workers_set = true;
+
+	for (i = 1; i < n; i++) {
+		worker = vhost_worker_create(dev);
+		if (!worker) {
+			ret = -ENOMEM;
+			break;
+		}
+		dev->vqs[i]->worker = worker;
+	}
+
+	vhost_propagate_workers(dev, i);
+
+	return ret;
+}
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
@@ -843,6 +879,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
 	vhost_workers_free(dev);
 	vhost_detach_mm(dev);
+	dev->workers_set = false;
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
 
@@ -1896,7 +1933,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
 	struct eventfd_ctx *ctx;
 	u64 p;
 	long r;
-	int i, fd;
+	int i, fd, n;
 
 	/* If you are not the owner, you can become one */
 	if (ioctl == VHOST_SET_OWNER) {
@@ -1953,6 +1990,17 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
 		if (ctx)
 			eventfd_ctx_put(ctx);
 		break;
+	case VHOST_SET_NWORKERS:
+		r = get_user(n, (int __user *)argp);
+		if (r < 0)
+			break;
+		if (d->workers_set) {
+			r = -EINVAL;
+			break;
+		}
+
+		r = vhost_set_workers(d, n);
+		break;
 	default:
 		r = -ENOIOCTLCMD;
 		break;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b40c041f2edb..b32d0ebf16f5 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -172,6 +172,7 @@ struct vhost_dev {
 	int byte_weight;
 	struct xarray worker_xa;
 	bool use_worker;
+	bool workers_set;
 	int (*msg_handler)(struct vhost_dev *dev, u32 asid,
 			   struct vhost_iotlb_msg *msg);
 };
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 92e1b700b51c..913b0d752735 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -71,6 +71,14 @@
 #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
 #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
 
+/* Set number of vhost workers.
+ * This can be done only once until reset.
+ * All workers are freed upon reset.
+ * If the value is too big it is silently truncated to the maximum number of
+ * supported vhost workers
+ */
+#define VHOST_SET_NWORKERS _IOW(VHOST_VIRTIO, 0x1F, int)
+
 /* The following ioctls use eventfd file descriptors to signal and poll
  * for events. */
 
-- 
2.39.3



More information about the Devel mailing list