[Devel] [PATCH vz10 0/7] per-VE ve.proc_permissions and sysfs permission fixes

Vladimir Riabchun vladimir.riabchun at virtuozzo.com
Mon Jun 29 17:21:57 MSK 2026



On 6/28/26 11:25, Mirian Shilakadze wrote:
> This series adds ve.proc_permissions, the procfs counterpart of
> ve.sysfs_permissions. It is a per-VE allowlist of /proc paths that the host
> exposes to a container, keyed per VE so the single shared proc tree gives
> per-VE answers.
> 
> The motivation is GPU support in containers. A containerized GPU workload
> needs a few host /proc files visible (the nvidia entries it probes), and
> ve.proc_permissions exposes them through a generic per-VE allowlist rather
> than an nvidia specific passthrough.
> 
> While implementing it I found several pre-existing defects in the shared
> sysfs and kernfs per-VE permission path, so the series is fix-first. Reading
> ve.sysfs_permissions under load already panicked the host on a stock kernel
> (NULL deref in kmapset_lookup), which the early patches fix before the procfs
> work builds on the same code. Of these defects the ve_perms_map
> use-after-free in patch 5 (the __rcu annotation) was found by code analysis.
> The rest surfaced through testing, the NULL deref and the wrong rwsem from the
> runtime crash and lockdep, and the rcu-list walks from PROVE_RCU_LIST.
> 
> Layout:
>    1: lib/kmapset annotates the kmapset_lookup rcu-list walk so it is honest
>       under CONFIG_PROVE_RCU_LIST.
>    2 to 5: fix the kernfs seq read and the VFS readers, skip a NULL map, lock
>       the tree that is actually walked, take rcu_read_lock around the kmapset
>       lookup, and mark ve_perms_map __rcu to close a use-after-free.
>    6: factors the filesystem agnostic core into fs/ve_perms.c with no
>       functional change beyond an rcu_assign_pointer publish.
>    7: adds the procfs feature on top.
> 
> Testing: Built and booted a debug kernel with KASAN, kmemleak, lockdep and
> PROVE_RCU_LIST. Ran concurrent reader, writer and teardown stress on both
> ve.sysfs_permissions and ve.proc_permissions, including in-container /proc
> and sysfs access and container start and stop. The original NULL deref
> reproduces on a stock kernel and no longer crashes with this series. No KASAN
> use-after-free, no kmemleak leak, and no rcu-list or lockdep splat in the
> ve_perms paths. gcov line coverage of the four touched files reached 93 to
> 99 percent (fs/kernfs/ve.c 99, fs/proc/ve.c 97, fs/ve_perms.c 95,
> lib/kmapset.c 93), the remainder being inlined fortify checks, error and
> boot-only init paths. Per-VE correctness was checked separately on both
> filesystems. A path becomes visible and readable inside a container only
> after it is added to that VE allowlist, access is revoked when it is
> removed, the host is unaffected, and the entry never leaks to another VE.

Just a brief question: it seems to me that /proc can have a very big number
of entries.

When we start in proc_perms_start, we take proc_subdir_lock, which is atomic.
For each entry we do kmapset_lookup, where we iterate over all added keys(if
I got correctly). And in general we do a big number of kmapset operations for
each entry.

Won't we get any performance issues, if we have, for example, huge amount of
processes and several CTs? Such long atomic sections could lead to lockups.

Other than that the code looks good, didn't find any obvious bugs.

> 
> Mirian Shilakadze (7):
>    lib/kmapset: annotate the kmapset_lookup rcu-list walk with the held
>      lock
>    fs/kernfs, ve: skip NULL ve_perms_map in kernfs_perms_shown
>    fs/kernfs, ve: lock the walked tree rwsem in kernfs_perms_start
>    fs/kernfs, ve: take rcu_read_lock around the ve_perms kmapset lookup
>    fs/kernfs, ve: fix ve_perms_map use-after-free, annotate it __rcu
>    fs: factor per-VE permission core into ve_perms helpers
>    fs/proc, ve: add per-VE ve.proc_permissions
> 
>   fs/Makefile               |   1 +
>   fs/kernfs/ve.c            | 167 ++++++++----------
>   fs/proc/Makefile          |   1 +
>   fs/proc/generic.c         |  48 +++++-
>   fs/proc/inode.c           |   2 +
>   fs/proc/internal.h        |  25 +++
>   fs/proc/root.c            |   1 +
>   fs/proc/ve.c              | 345 ++++++++++++++++++++++++++++++++++++++
>   fs/sysfs/ve.c             |   2 +-
>   fs/ve_perms.c             | 136 +++++++++++++++
>   include/linux/kernfs-ve.h |   2 +-
>   include/linux/kernfs.h    |   2 +-
>   include/linux/ve-perms.h  |  28 ++++
>   include/linux/ve.h        |   1 +
>   kernel/ve/ve.c            |   7 +
>   lib/kmapset.c             |   3 +-
>   16 files changed, 665 insertions(+), 106 deletions(-)
>   create mode 100644 fs/proc/ve.c
>   create mode 100644 fs/ve_perms.c
>   create mode 100644 include/linux/ve-perms.h
> 
> --
> 2.43.0
> 
> _______________________________________________
> Devel mailing list
> Devel at openvz.org
> https://lists.openvz.org/mailman/listinfo/devel

-- 
Best regards, Riabchun Vladimir
Linux Kernel Developer, Virtuozzo



More information about the Devel mailing list