[Devel] [PATCH RH9 v5 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests

Mon Nov 14 10:41:34 MSK 2022

Reviewed-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>

On 11.11.2022 12:55, Andrey Zhadchenko wrote:
> Although QEMU virtio-blk is quite fast, there is still some room for
> improvements. Disk latency can be reduced if we handle virito-blk requests
> in host kernel so we avoid a lot of syscalls and context switches.
> The idea is quite simple - QEMU gives us block device and we translate
> any incoming virtio requests into bio and push them into bdev.
> The biggest disadvantage of this vhost-blk flavor is raw format.
> Luckily Kirill Thai proposed device mapper driver for QCOW2 format to attach
> files as block devices: https://www.spinics.net/lists/kernel/msg4292965.html
> 
> Also by using kernel modules we can bypass iothread limitation and finaly scale
> block requests with cpus for high-performance devices.
> 
> 
> There have already been several attempts to write vhost-blk:
> 
> Asias' version: https://lkml.org/lkml/2012/12/1/174
> Badari's version: https://lwn.net/Articles/379864/
> Vitaly's https://lwn.net/Articles/770965/
> 
> The main difference between them is API to access backend file. The fastest
> one is Asias's version with bio flavor. It is also the most reviewed and
> have the most features. So vhost_blk module is partially based on it. Multiple
> virtqueue support was addded, some places reworked. Added support for several
> vhost workers.
> 
> test setup and results:
> fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
> QEMU drive options: cache=none
> filesystem: xfs
> 
> SSD:
>                 | randread, IOPS  | randwrite, IOPS |
> Host           |      95.8k	 |	85.3k	   |
> QEMU virtio    |      57.5k	 |	79.4k	   |
> QEMU vhost-blk |      95.6k	 |	84.3k	   |
> 
> RAMDISK (vq == vcpu == numjobs):
>                   | randread, IOPS | randwrite, IOPS |
> virtio, 1vcpu    |	133k	  |	 133k       |
> virtio, 2vcpu    |	305k	  |	 306k       |
> virtio, 4vcpu    |	310k	  |	 298k       |
> virtio, 8vcpu    |	271k	  |	 252k       |
> vhost-blk, 1vcpu |	110k	  |	 113k       |
> vhost-blk, 2vcpu |	247k	  |	 252k       |
> vhost-blk, 4vcpu |	558k	  |	 556k       |
> vhost-blk, 8vcpu |	576k	  |	 575k       | *single kernel thread
> vhost-blk, 8vcpu |	803k	  |	 779k       | *two kernel threads
> 
> v2:
> patch 1/10
>   - removed unused VHOST_BLK_VQ
>   - reworked bio handling a bit: now add all pages from signle iov into
> single bio istead of allocating one bio per page
>   - changed how to calculate sector incrementation
>   - check move_iovec() in vhost_blk_req_handle()
>   - remove snprintf check and better check ret from copy_to_iter for
> VIRTIO_BLK_ID_BYTES requests
>   - discard vq request if vhost_blk_req_handle() returned negative code
>   - forbid to change nonzero backend in vhost_blk_set_backend(). First of
> all, QEMU sets backend only once. Also if we want to change backend when
> we already running requests we need to be much more careful in
> vhost_blk_handle_guest_kick() as it is not taking any references. If
> userspace want to change backend that bad it can always reset device.
>   - removed EXPERIMENTAL from Kconfig
> 
> patch 3/10
>   - don't bother with checking dev->workers[0].worker since dev->nworkers
> will always contain 0 in this case
> 
> patch 6/10
>   - Make code do what docs suggest. Previously ioctl-supplied new number
> of workers were treated like an amount that should be added. Use new
> number as a ceiling instead and add workers up to that number.
> 
> 
> v3:
> patch 1/10
>   - reworked bio handling a bit - now create new only if the previous is
> full
> 
> patch 2/10
>   - set vq->worker = NULL in vhost_vq_reset()
> 
> 
> v4:
> patch 1/10
>   - vhost_blk_req_done() now won't hide errors for multi-bio requests
>   - vhost_blk_prepare_req() now better estimates bio_len
>   - alloc bio for max pages_nr_total pages instead of nr_pages
>   - added new ioctl VHOST_BLK_SET_SERIAL to set serial
>   - rework flush alghoritm a bit - now use two bins "new req" and
> "for flush" and swap them at the start of the flush
>   - moved backing file dereference to vhost_blk_req_submit() and
> after request was added to flush bin to avoid race in
> vhost_blk_release(). Now even if we dropped backend and started
> flush the request will either be tracked by flush or be rolled back
> 
> patch 2/10
>   - moved vq->worker = NULL to patch #7 where this field is
> introduced.
> 
> patch 7/10
>   - Set vq->worker = NULL in vhost_vq_reset. This will fix both
> https://jira.sw.ru/browse/PSBM-142058
> https://jira.sw.ru/browse/PSBM-142852
> 
> v5:
> patch 1/10
>   - several codestyle/spacing fixes
>   - added WARN_ON() for vhost_blk_flush
> 
> Andrey Zhadchenko (10):
>    drivers/vhost: vhost-blk accelerator for virtio-blk guests
>    drivers/vhost: use array to store workers
>    drivers/vhost: adjust vhost to flush all workers
>    drivers/vhost: rework attaching cgroups to be worker aware
>    drivers/vhost: rework worker creation
>    drivers/vhost: add ioctl to increase the number of workers
>    drivers/vhost: assign workers to virtqueues
>    drivers/vhost: add API to queue work at virtqueue worker
>    drivers/vhost: allow polls to be bound to workers via vqs
>    drivers/vhost: queue vhost_blk works at vq workers
> 
>   drivers/vhost/Kconfig      |  12 +
>   drivers/vhost/Makefile     |   3 +
>   drivers/vhost/blk.c        | 860 +++++++++++++++++++++++++++++++++++++
>   drivers/vhost/vhost.c      | 253 ++++++++---
>   drivers/vhost/vhost.h      |  21 +-
>   include/uapi/linux/vhost.h |  17 +
>   6 files changed, 1104 insertions(+), 62 deletions(-)
>   create mode 100644 drivers/vhost/blk.c
> 

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.