[Devel] [PATCH vz10 v2] vhost-blk: re-validate vq setup against userspace double-fetch

Mon Jun 22 10:16:03 MSK 2026


On 6/19/26 20:57, Konstantin Khorenko wrote:
> vhost_blk_setup() uses the virtqueue index supplied by userspace as an
> array index into blk->vqs[] without bounds checking:
> 
>          if (copy_from_user(&s, argp, sizeof(s)))
>                  return -EFAULT;
>          if (blk->vqs[s.index].req)              /* s.index unchecked */
>                  return 0;
>          blk->vqs[s.index].req = kvmalloc(...);
> 
> blk->vqs[] is a fixed array of VHOST_BLK_VQ_MAX (32) inline
> struct vhost_blk_vq elements embedded in struct vhost_blk.  sizeof()
> of each element is large (it contains iov[UIO_MAXIOV]), so an
> out-of-range s.index points many megabytes outside the allocation:
> reading blk->vqs[s.index].req typically oopses, and if that address
> happens to be mapped and reads as NULL, the following assignment
> performs a wild kernel write -> memory corruption.
> 
> At first glance s.index looks safe because vhost_blk_setup() is only
> reached from the VHOST_SET_VRING_NUM path:
> 
>          ret = vhost_vring_ioctl(&blk->dev, ioctl, argp);
>          if (!ret && ioctl == VHOST_SET_VRING_NUM)
>                  ret = vhost_blk_setup(blk, argp);
> 
> and vhost_vring_ioctl() -> vhost_get_vq_from_user() already validates
> the index (idx >= dev->nvqs -> -ENOBUFS) and even applies
> array_index_nospec().
> 
> The catch is that this is a classic double-fetch (TOCTOU, CWE-367).
> argp is a pointer into the *caller's* address space.  The struct is
> read from it twice:
> 
>    1. vhost_get_vq_from_user() does get_user(idx, idxp) and validates it,
>       and vhost_vring_set_num() copies and validates the same struct to
>       set vq->num;
>    2. vhost_blk_setup() does a *second* copy_from_user() of the same
>       struct and uses s.index and s.num directly.
> 
> blk->dev.mutex is held across both reads, but that mutex only serializes
> other in-kernel ioctls on this fd.  It does not stop a *different thread
> of the same process* (which shares the address space) from overwriting
> the struct between the two reads with a plain userspace store - that
> thread never enters the kernel at all:
> 
>          Thread T1:                          Thread T2 (same process):
>            state->index = 0;
>            ioctl(fd, VHOST_SET_VRING_NUM, state)
>              get_user(index) == 0 -> OK
>                                                state->index = 0x41414141;
>              copy_from_user(&s, argp)
>              s.index == 0x41414141
>              blk->vqs[0x41414141]      <- out of bounds
> 
> So the values validated in step 1 are not necessarily the values used in
> step 2.  Both userspace-controlled fields of the second fetch need care:
> 
>    - s.index: re-validate it against VHOST_BLK_VQ_MAX right where it is
>      used.  The bounds check is a predictable branch, so also clamp the
>      index with array_index_nospec() so a misprediction cannot
>      speculatively index blk->vqs[] out of range - matching what
>      vhost_get_vq_from_user() already does for the first read.
> 
>    - s.num: it is the element count of the per-request array
>      blk->vqs[s.index].req, but that array is later indexed by the
>      descriptor head, which is bounded by vq->num (0 .. vq->num - 1), not
>      by s.num:
> 
>          req = &blk_vq->req[head];       /* in vhost_blk_req_handle() */
> 
>      vq->num was set from vhost_vring_set_num()'s own validated copy.  A
>      raced s.num smaller than vq->num would leave req[] too small and the
>      head index would run off the end; a raced s.num == 0 makes kvmalloc()
>      return ZERO_SIZE_PTR, which passes the !req check and faults on first
>      use.  Size req[] from the validated vq->num instead of from the racy
>      s.num; that is also the correct count, since the array is indexed by
>      head in the ring range.
> 
> Practical exposure is limited: the trigger requires a process that
> already owns the vhost-blk fd (a host-side VMM, not the guest) and that
> wins a narrow race window, so this is local hardening on the
> kernel/userspace boundary rather than a guest-triggerable escape.  It is
> still memory-unsafe and must be fixed.
> 
> Fixes: 40a5928ec730 ("drivers/vhost: vhost-blk accelerator for virtio-blk guests")
> 
> Feature: vhost-blk: in-kernel accelerator for virtio-blk guests
> Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>

Reviewed-by: Vladimir Riabchun <vladimir.riabchun at virtuozzo.com>

> ---
> Changes since v1 ("vhost-blk: re-validate vq index in vhost_blk_setup()
> (double-fetch)"):
> 
> v2:
>   - Also harden s.index against a speculative out-of-bounds access: after
>     the bounds check, clamp the index with array_index_nospec().  The
>     bounds branch is almost always not-taken, so the CPU may speculatively
>     index blk->vqs[] with an out-of-range s.index before the check
>     resolves; this matches what vhost_get_vq_from_user() already does for
>     the first read.  (Suggested on review.)
> 
>   - Close the s.num half of the same double-fetch, which v1 missed.  v1's
>     commit message claimed s.num "needs no such treatment ... just fails
>     with -ENOMEM"; that is wrong.  s.num is read by the same second
>     copy_from_user() and sizes blk->vqs[].req, which is later indexed by
>     the descriptor head in 0 .. vq->num - 1 (vq->num comes from the first,
>     validated fetch in vhost_vring_set_num()).  A raced s.num < vq->num
>     leaves req[] too small -> out-of-bounds req[head] in
>     vhost_blk_req_handle(); a raced s.num == 0 makes kvmalloc() return
>     ZERO_SIZE_PTR, which passes the !req check and faults on first use.
>     v2 sizes req[] from the validated vq->num and ignores the racy s.num.
> 
>   - Add #include <linux/nospec.h> for array_index_nospec().
> 
>   - Reword the subject ("vq index" -> "vq setup") and the commit message
>     to cover both s.index and s.num.
> 
> No functional change on the legitimate, non-racing path: a correct VMM
> passes s.num == vq->num so req[] keeps the same size, and a valid s.index
> is left unchanged by array_index_nospec().
> 
>   drivers/vhost/blk.c | 31 ++++++++++++++++++++++++++++++-
>   1 file changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
> index c03945ac04234..db10434f3ae18 100644
> --- a/drivers/vhost/blk.c
> +++ b/drivers/vhost/blk.c
> @@ -21,6 +21,7 @@
>   #include <linux/kthread.h>
>   #include <linux/blkdev.h>
>   #include <linux/llist.h>
> +#include <linux/nospec.h>
>   
>   #include "vhost.h"
>   
> @@ -860,14 +861,42 @@ static long vhost_blk_reset_owner(struct vhost_blk *blk)
>   static int vhost_blk_setup(struct vhost_blk *blk, void __user *argp)
>   {
>   	struct vhost_vring_state s;
> +	struct vhost_virtqueue *vq;
>   
>   	if (copy_from_user(&s, argp, sizeof(s)))
>   		return -EFAULT;
>   
> +	/*
> +	 * s.index was validated by vhost_vring_ioctl() before we got here,
> +	 * but that was a separate read of the same userspace memory.  This is
> +	 * a fresh copy_from_user(), so re-validate it before using it as an
> +	 * array index (see the commit message for the double-fetch race).
> +	 */
> +	if (s.index >= VHOST_BLK_VQ_MAX)
> +		return -ENOBUFS;
> +	/*
> +	 * The bounds check above is a predictable branch, so the CPU may
> +	 * speculatively index blk->vqs[] with an out-of-range s.index before
> +	 * it resolves.  Clamp the index for speculation, as
> +	 * vhost_get_vq_from_user() already does for the first read.
> +	 */
> +	s.index = array_index_nospec(s.index, VHOST_BLK_VQ_MAX);
> +	vq = &blk->vqs[s.index].vq;
> +
>   	if (blk->vqs[s.index].req)
>   		return 0;
>   
> -	blk->vqs[s.index].req = kvmalloc(sizeof(struct vhost_blk_req) * s.num, GFP_KERNEL);
> +	/*
> +	 * Size the request array from vq->num, the ring size the kernel
> +	 * validated and stored in vhost_vring_set_num().  s.num comes from the
> +	 * same second copy_from_user() as s.index and is equally racy: a
> +	 * concurrent userspace store could shrink it below vq->num, leaving
> +	 * req[] too small for the head indices (0 .. vq->num - 1) later used
> +	 * in vhost_blk_req_handle().  vq->num is always non-zero and a power
> +	 * of two here, so it is the safe and correct count.
> +	 */
> +	blk->vqs[s.index].req = kvmalloc(sizeof(struct vhost_blk_req) * vq->num,
> +					GFP_KERNEL);
>   	if (!blk->vqs[s.index].req)
>   		return -ENOMEM;
>   

-- 
Best regards, Riabchun Vladimir
Linux Kernel Developer, Virtuozzo