[Devel] [PATCH vz10 0/7] per-VE ve.proc_permissions and sysfs permission fixes

Mirian Shilakadze mirian.shilakadze at virtuozzo.com
Sun Jun 28 12:25:58 MSK 2026


This series adds ve.proc_permissions, the procfs counterpart of
ve.sysfs_permissions. It is a per-VE allowlist of /proc paths that the host
exposes to a container, keyed per VE so the single shared proc tree gives
per-VE answers.

The motivation is GPU support in containers. A containerized GPU workload
needs a few host /proc files visible (the nvidia entries it probes), and
ve.proc_permissions exposes them through a generic per-VE allowlist rather
than an nvidia specific passthrough.

While implementing it I found several pre-existing defects in the shared
sysfs and kernfs per-VE permission path, so the series is fix-first. Reading
ve.sysfs_permissions under load already panicked the host on a stock kernel
(NULL deref in kmapset_lookup), which the early patches fix before the procfs
work builds on the same code. Of these defects the ve_perms_map
use-after-free in patch 5 (the __rcu annotation) was found by code analysis.
The rest surfaced through testing, the NULL deref and the wrong rwsem from the
runtime crash and lockdep, and the rcu-list walks from PROVE_RCU_LIST.

Layout:
  1: lib/kmapset annotates the kmapset_lookup rcu-list walk so it is honest
     under CONFIG_PROVE_RCU_LIST.
  2 to 5: fix the kernfs seq read and the VFS readers, skip a NULL map, lock
     the tree that is actually walked, take rcu_read_lock around the kmapset
     lookup, and mark ve_perms_map __rcu to close a use-after-free.
  6: factors the filesystem agnostic core into fs/ve_perms.c with no
     functional change beyond an rcu_assign_pointer publish.
  7: adds the procfs feature on top.

Testing: Built and booted a debug kernel with KASAN, kmemleak, lockdep and
PROVE_RCU_LIST. Ran concurrent reader, writer and teardown stress on both
ve.sysfs_permissions and ve.proc_permissions, including in-container /proc
and sysfs access and container start and stop. The original NULL deref
reproduces on a stock kernel and no longer crashes with this series. No KASAN
use-after-free, no kmemleak leak, and no rcu-list or lockdep splat in the
ve_perms paths. gcov line coverage of the four touched files reached 93 to
99 percent (fs/kernfs/ve.c 99, fs/proc/ve.c 97, fs/ve_perms.c 95,
lib/kmapset.c 93), the remainder being inlined fortify checks, error and
boot-only init paths. Per-VE correctness was checked separately on both
filesystems. A path becomes visible and readable inside a container only
after it is added to that VE allowlist, access is revoked when it is
removed, the host is unaffected, and the entry never leaks to another VE.

Mirian Shilakadze (7):
  lib/kmapset: annotate the kmapset_lookup rcu-list walk with the held
    lock
  fs/kernfs, ve: skip NULL ve_perms_map in kernfs_perms_shown
  fs/kernfs, ve: lock the walked tree rwsem in kernfs_perms_start
  fs/kernfs, ve: take rcu_read_lock around the ve_perms kmapset lookup
  fs/kernfs, ve: fix ve_perms_map use-after-free, annotate it __rcu
  fs: factor per-VE permission core into ve_perms helpers
  fs/proc, ve: add per-VE ve.proc_permissions

 fs/Makefile               |   1 +
 fs/kernfs/ve.c            | 167 ++++++++----------
 fs/proc/Makefile          |   1 +
 fs/proc/generic.c         |  48 +++++-
 fs/proc/inode.c           |   2 +
 fs/proc/internal.h        |  25 +++
 fs/proc/root.c            |   1 +
 fs/proc/ve.c              | 345 ++++++++++++++++++++++++++++++++++++++
 fs/sysfs/ve.c             |   2 +-
 fs/ve_perms.c             | 136 +++++++++++++++
 include/linux/kernfs-ve.h |   2 +-
 include/linux/kernfs.h    |   2 +-
 include/linux/ve-perms.h  |  28 ++++
 include/linux/ve.h        |   1 +
 kernel/ve/ve.c            |   7 +
 lib/kmapset.c             |   3 +-
 16 files changed, 665 insertions(+), 106 deletions(-)
 create mode 100644 fs/proc/ve.c
 create mode 100644 fs/ve_perms.c
 create mode 100644 include/linux/ve-perms.h

--
2.43.0



More information about the Devel mailing list