[Devel] [PATCH vz10 v2] vhost-blk: re-validate vq setup against userspace double-fetch

Thu Jun 25 17:58:34 MSK 2026

On 6/19/26 20:57, Konstantin Khorenko wrote:
> vhost_blk_setup() uses the virtqueue index supplied by userspace as an
> array index into blk->vqs[] without bounds checking:
> 
>         if (copy_from_user(&s, argp, sizeof(s)))
>                 return -EFAULT;
>         if (blk->vqs[s.index].req)              /* s.index unchecked */
>                 return 0;
>         blk->vqs[s.index].req = kvmalloc(...);
> 
> blk->vqs[] is a fixed array of VHOST_BLK_VQ_MAX (32) inline
> struct vhost_blk_vq elements embedded in struct vhost_blk.  sizeof()
> of each element is large (it contains iov[UIO_MAXIOV]), so an
> out-of-range s.index points many megabytes outside the allocation:
> reading blk->vqs[s.index].req typically oopses, and if that address
> happens to be mapped and reads as NULL, the following assignment
> performs a wild kernel write -> memory corruption.
> 
> At first glance s.index looks safe because vhost_blk_setup() is only
> reached from the VHOST_SET_VRING_NUM path:
> 
>         ret = vhost_vring_ioctl(&blk->dev, ioctl, argp);
>         if (!ret && ioctl == VHOST_SET_VRING_NUM)
>                 ret = vhost_blk_setup(blk, argp);
> 
> and vhost_vring_ioctl() -> vhost_get_vq_from_user() already validates
> the index (idx >= dev->nvqs -> -ENOBUFS) and even applies
> array_index_nospec().
> 
> The catch is that this is a classic double-fetch (TOCTOU, CWE-367).
> argp is a pointer into the *caller's* address space.  The struct is
> read from it twice:
> 
>   1. vhost_get_vq_from_user() does get_user(idx, idxp) and validates it,
>      and vhost_vring_set_num() copies and validates the same struct to
>      set vq->num;
>   2. vhost_blk_setup() does a *second* copy_from_user() of the same
>      struct and uses s.index and s.num directly.
> 
> blk->dev.mutex is held across both reads, but that mutex only serializes
> other in-kernel ioctls on this fd.  It does not stop a *different thread
> of the same process* (which shares the address space) from overwriting
> the struct between the two reads with a plain userspace store - that
> thread never enters the kernel at all:
> 
>         Thread T1:                          Thread T2 (same process):
>           state->index = 0;
>           ioctl(fd, VHOST_SET_VRING_NUM, state)
>             get_user(index) == 0 -> OK
>                                               state->index = 0x41414141;
>             copy_from_user(&s, argp)
>             s.index == 0x41414141
>             blk->vqs[0x41414141]      <- out of bounds
> 
> So the values validated in step 1 are not necessarily the values used in
> step 2.  Both userspace-controlled fields of the second fetch need care:
> 
>   - s.index: re-validate it against VHOST_BLK_VQ_MAX right where it is
>     used.  The bounds check is a predictable branch, so also clamp the
>     index with array_index_nospec() so a misprediction cannot
>     speculatively index blk->vqs[] out of range - matching what
>     vhost_get_vq_from_user() already does for the first read.
> 
>   - s.num: it is the element count of the per-request array
>     blk->vqs[s.index].req, but that array is later indexed by the
>     descriptor head, which is bounded by vq->num (0 .. vq->num - 1), not
>     by s.num:
> 
>         req = &blk_vq->req[head];       /* in vhost_blk_req_handle() */
> 
>     vq->num was set from vhost_vring_set_num()'s own validated copy.  A
>     raced s.num smaller than vq->num would leave req[] too small and the
>     head index would run off the end; a raced s.num == 0 makes kvmalloc()
>     return ZERO_SIZE_PTR, which passes the !req check and faults on first
>     use.  Size req[] from the validated vq->num instead of from the racy
>     s.num; that is also the correct count, since the array is indexed by
>     head in the ring range.
> 
> Practical exposure is limited: the trigger requires a process that
> already owns the vhost-blk fd (a host-side VMM, not the guest) and that
> wins a narrow race window, so this is local hardening on the
> kernel/userspace boundary rather than a guest-triggerable escape.  It is
> still memory-unsafe and must be fixed.

But, if we still read index twice it means that index in vhost_get_vq_from_user
and vhost_blk_setup can still technically be different and this inconsistency
can still lead to problems in ioctl which does not expect this change and maybe
even exploit and break somewhere else.

p.s.: AI review based on this prompt:

You were right to push on this. The patch fixes the out-of-bounds symptom of 
the double-fetch but not the inconsistency the double-fetch creates:

- vhost_blk_setup() re-reads the index from a fresh copy_from_user(), so it can 
  still land on a different vq (B) than the one vhost_vring_ioctl() validated
  and set vq->num for (A).
- The result is an incoherent state: vqs[A] has num set but req == NULL; req[]
  got allocated for the unrelated vqs[B].
- vhost_blk_req_handle() does req = &blk_vq->req[head] with no NULL guard, and 
  neither the guest-kick handler nor set_backend checks req, so the next guest
  request on vq A dereferences NULL.

So the proper fix is to eliminate the second fetch — pass the index 
vhost_get_vq_from_user() already validated into vhost_blk_setup() — rather than
independently re-validating a second read.

Two honest qualifiers I kept in the writeup: (1) this consistency gap is 
pre-existing, not introduced by this commit — the patch is still a strict
improvement (it removes the far-OOB wild write); and (2) it shares the same
threat model the commit acknowledges (requires the host VMM to race itself),
and after the patch it degrades to a NULL-area deref rather than a controllable
wild write.

> 
> Fixes: 40a5928ec730 ("drivers/vhost: vhost-blk accelerator for virtio-blk guests")
> 
> Feature: vhost-blk: in-kernel accelerator for virtio-blk guests
> Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
> ---
> Changes since v1 ("vhost-blk: re-validate vq index in vhost_blk_setup()
> (double-fetch)"):
> 
> v2:
>  - Also harden s.index against a speculative out-of-bounds access: after
>    the bounds check, clamp the index with array_index_nospec().  The
>    bounds branch is almost always not-taken, so the CPU may speculatively
>    index blk->vqs[] with an out-of-range s.index before the check
>    resolves; this matches what vhost_get_vq_from_user() already does for
>    the first read.  (Suggested on review.)
> 
>  - Close the s.num half of the same double-fetch, which v1 missed.  v1's
>    commit message claimed s.num "needs no such treatment ... just fails
>    with -ENOMEM"; that is wrong.  s.num is read by the same second
>    copy_from_user() and sizes blk->vqs[].req, which is later indexed by
>    the descriptor head in 0 .. vq->num - 1 (vq->num comes from the first,
>    validated fetch in vhost_vring_set_num()).  A raced s.num < vq->num
>    leaves req[] too small -> out-of-bounds req[head] in
>    vhost_blk_req_handle(); a raced s.num == 0 makes kvmalloc() return
>    ZERO_SIZE_PTR, which passes the !req check and faults on first use.
>    v2 sizes req[] from the validated vq->num and ignores the racy s.num.
> 
>  - Add #include <linux/nospec.h> for array_index_nospec().
> 
>  - Reword the subject ("vq index" -> "vq setup") and the commit message
>    to cover both s.index and s.num.
> 
> No functional change on the legitimate, non-racing path: a correct VMM
> passes s.num == vq->num so req[] keeps the same size, and a valid s.index
> is left unchanged by array_index_nospec().
> 
>  drivers/vhost/blk.c | 31 ++++++++++++++++++++++++++++++-
>  1 file changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
> index c03945ac04234..db10434f3ae18 100644
> --- a/drivers/vhost/blk.c
> +++ b/drivers/vhost/blk.c
> @@ -21,6 +21,7 @@
>  #include <linux/kthread.h>
>  #include <linux/blkdev.h>
>  #include <linux/llist.h>
> +#include <linux/nospec.h>
>  
>  #include "vhost.h"
>  
> @@ -860,14 +861,42 @@ static long vhost_blk_reset_owner(struct vhost_blk *blk)
>  static int vhost_blk_setup(struct vhost_blk *blk, void __user *argp)
>  {
>  	struct vhost_vring_state s;
> +	struct vhost_virtqueue *vq;
>  
>  	if (copy_from_user(&s, argp, sizeof(s)))
>  		return -EFAULT;
>  
> +	/*
> +	 * s.index was validated by vhost_vring_ioctl() before we got here,
> +	 * but that was a separate read of the same userspace memory.  This is
> +	 * a fresh copy_from_user(), so re-validate it before using it as an
> +	 * array index (see the commit message for the double-fetch race).
> +	 */
> +	if (s.index >= VHOST_BLK_VQ_MAX)
> +		return -ENOBUFS;
> +	/*
> +	 * The bounds check above is a predictable branch, so the CPU may
> +	 * speculatively index blk->vqs[] with an out-of-range s.index before
> +	 * it resolves.  Clamp the index for speculation, as
> +	 * vhost_get_vq_from_user() already does for the first read.
> +	 */
> +	s.index = array_index_nospec(s.index, VHOST_BLK_VQ_MAX);
> +	vq = &blk->vqs[s.index].vq;
> +
>  	if (blk->vqs[s.index].req)
>  		return 0;
>  
> -	blk->vqs[s.index].req = kvmalloc(sizeof(struct vhost_blk_req) * s.num, GFP_KERNEL);
> +	/*
> +	 * Size the request array from vq->num, the ring size the kernel
> +	 * validated and stored in vhost_vring_set_num().  s.num comes from the
> +	 * same second copy_from_user() as s.index and is equally racy: a
> +	 * concurrent userspace store could shrink it below vq->num, leaving
> +	 * req[] too small for the head indices (0 .. vq->num - 1) later used
> +	 * in vhost_blk_req_handle().  vq->num is always non-zero and a power
> +	 * of two here, so it is the safe and correct count.
> +	 */
> +	blk->vqs[s.index].req = kvmalloc(sizeof(struct vhost_blk_req) * vq->num,
> +					GFP_KERNEL);
>  	if (!blk->vqs[s.index].req)
>  		return -ENOMEM;
>  

-- 
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.