[Devel] [PATCH RHEL10 COMMIT] selftests/ve: regression test for CLONE_NEWVE owner correctness
Konstantin Khorenko
khorenko at virtuozzo.com
Thu May 14 18:53:28 MSK 2026
The commit is pushed to "branch-rh10-6.12.0-55.52.1.5.x.vz10-ovz" and will appear at git at bitbucket.org:openvz/vzkernel.git
after rh10-6.12.0-55.52.1.5.24.vz10
------>
commit 4fe045fa7c00aa399c3e93745a416caef58fb971
Author: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
Date: Wed Apr 29 15:41:42 2026 +0200
selftests/ve: regression test for CLONE_NEWVE owner correctness
Add a small kselftest that exercises the case fixed by the preceding
patches: combining CLONE_NEWVE with CLONE_NEWNET and CLONE_NEWNS in
a single clone3() or unshare(), and verifying that the resulting net
and mount namespaces are owned by the *new* ve, not the parent ve.
Two test cases share the same shape:
clone_newve_newnet_newns
Do clone3(CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS) in a fresh
ve cgroup. Once the child signals readiness via a sync pipe,
the parent inspects the new ve's counters.
unshare_newve_newnet_newns
A fork()ed child does unshare(CLONE_NEWVE | CLONE_NEWNET |
CLONE_NEWNS) in a fresh ve cgroup. Once the child signals readiness
via a sync pipe, the parent inspects the new ve's counters.
Both tests call a single check_new_ve_owner() helper that asserts:
ve.netns_avail_nr == VE_NETNS_MAX - 1
FIXTURE_SETUP caps the new ve's ve.netns_max_nr to a small
value (3) so the single netns charged to it is unambiguously
detectable. Pre-fix this counter would have stayed at the cap
because copy_net_ns() charged the parent ve via get_exec_env().
ve.mnt_nr > 0
copy_mnt_ns() / copy_tree() populates the new mntns by cloning
the parent's mounts; each clone is charged to the new ve via
ve_mount_nr_inc(). FIXTURE_SETUP additionally asserts that
ve.mnt_nr starts at 0 on the freshly created ve cgroup, so the
post-clone '> 0' assertion has well-defined meaning. Pre-fix
the counter would have stayed at 0 (mounts charged to parent).
Note: The mount-side check relies on the ve.mnt_nr cgroup file added by
the preceding patch.
Wire the new directory into tools/testing/selftests/Makefile.
https://virtuozzo.atlassian.net/browse/VSTOR-129744
Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
Reviewed-by: Vasileios Almpanis <vasileios.almpanis at virtuozzo.com>
Feature: ve: ve generic structures
======
Patchset description:
ve: fix owner_ve of net/mnt namespaces created together with CLONE_NEWVE
When CLONE_NEWVE is combined with CLONE_NEWNET and/or CLONE_NEWNS in a
single clone3() or unshare(), copy_net_ns() and copy_mnt_ns() resolve
the owning ve via get_exec_env(), which still points at the parent ve
at that point. The freshly created net/mnt namespaces end up wired to
the wrong ve, and unshare(CLONE_NEWVE | CLONE_NEW{NS,NET}) is rejected
outright by check_unshare_flags().
Fix it by threading the new ve from copy_namespaces() and
unshare_nsproxy_namespaces() down into copy_net_ns() and copy_mnt_ns(),
so the correct ve is charged for the new netns and for every mount in
the new mntns.
Patches 1-4 are pure plumbing (signature changes, no behaviour change).
Patch 5 is the actual fix that forwards the new ve. Patch 6 drops the
now-redundant CLONE_NEWVE-alone restriction in check_unshare_flags().
Patch 7 exposes ve.mnt_nr via cgroupfs to make per-ve mount accounting
observable from userspace. Patch 8 adds a selftest covering both the
clone3() and unshare() paths.
Verified with crash on a vzctl-started container: task_ve,
nsproxy->net_ns->owner_ve, nsproxy->mnt_ns->ve_owner and
nsproxy->mnt_ns->root.ve_owner all resolve to the new ve.
The new selftest passes both cases.
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/ve/.gitignore | 1 +
tools/testing/selftests/ve/Makefile | 7 +
tools/testing/selftests/ve/ve_ns_owner_test.c | 425 ++++++++++++++++++++++++++
4 files changed, 434 insertions(+)
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index b30e459572022..700ee6bd916fd 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -113,6 +113,7 @@ TARGETS += tty
TARGETS += uevent
TARGETS += user_events
TARGETS += vDSO
+TARGETS += ve
TARGETS += ve_printk
TARGETS += mm
TARGETS += x86
diff --git a/tools/testing/selftests/ve/.gitignore b/tools/testing/selftests/ve/.gitignore
new file mode 100644
index 0000000000000..7bfff054f9e86
--- /dev/null
+++ b/tools/testing/selftests/ve/.gitignore
@@ -0,0 +1 @@
+ve_ns_owner_test
diff --git a/tools/testing/selftests/ve/Makefile b/tools/testing/selftests/ve/Makefile
new file mode 100644
index 0000000000000..aa03ab02dda9d
--- /dev/null
+++ b/tools/testing/selftests/ve/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for ve selftests.
+CFLAGS += -g -Wall -O2
+
+TEST_GEN_PROGS += ve_ns_owner_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/ve/ve_ns_owner_test.c b/tools/testing/selftests/ve/ve_ns_owner_test.c
new file mode 100644
index 0000000000000..1f82955eb4c34
--- /dev/null
+++ b/tools/testing/selftests/ve/ve_ns_owner_test.c
@@ -0,0 +1,425 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ve_ns_owner selftests
+ *
+ * Regression tests for the case where CLONE_NEWVE is combined with
+ * CLONE_NEWNET and CLONE_NEWNS in a single clone3() or unshare()
+ * syscall.
+ *
+ * Historically copy_net_ns() and copy_mnt_ns() resolved the owning ve
+ * via get_exec_env() at the time of the call, which since we've
+ * switched from cgroup based to namespace based get_exec_env() pointed
+ * at the parent's ve when copy_ve_ns() had just installed a new ve on
+ * the child (clone path) or before unshare_ve_namespace()'s result was
+ * committed (unshare path). That left the freshly created network and
+ * mount namespaces wired to the wrong ve.
+ *
+ * Both tests follow the same shape: a child blocks inside a fresh ve
+ * with a new netns and mntns, the parent reads the new ve's counters
+ * via cgroupfs and asserts they reflect the just-created namespaces:
+ * - ve.netns_avail_nr drops by exactly one (the new netns);
+ * - ve.mnt_nr is strictly greater than zero (mounts copied into the
+ * new mntns are accounted to the new ve).
+ *
+ * We never assert against the parent ve's counters: those are shared
+ * with everything else on the host (systemd, container managers, ...)
+ * and observing them is racy.
+ */
+#define _GNU_SOURCE
+#include <linux/sched.h>
+#include <linux/mount.h>
+#include <sched.h>
+#include <sys/wait.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+#include <asm/unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/mount.h>
+#include <linux/limits.h>
+#include <errno.h>
+
+#include "../kselftest_harness.h"
+
+#define CTID_MIN 108
+#define CTID_MAX 200
+
+#ifndef CLONE_NEWVE
+#define CLONE_NEWVE 0x00000040
+#endif
+
+/*
+ * Make ve.netns_avail_nr movements easy to detect: a small cap means
+ * any spurious accounting against the parent ve would overflow it.
+ */
+#define VE_NETNS_MAX 3
+
+static int write_file_at(int dirfd, const char *path, const char *val)
+{
+ int fd, ret;
+ size_t len = strlen(val);
+
+ fd = openat(dirfd, path, O_WRONLY);
+ if (fd < 0)
+ return -1;
+
+ ret = write(fd, val, len);
+ close(fd);
+ return (ret == (int)len) ? 0 : -1;
+}
+
+static int read_u64_at(int dirfd, const char *path, unsigned long long *out)
+{
+ char buf[32] = {0};
+ int fd, ret;
+
+ fd = openat(dirfd, path, O_RDONLY);
+ if (fd < 0)
+ return -1;
+
+ ret = read(fd, buf, sizeof(buf) - 1);
+ close(fd);
+ if (ret <= 0)
+ return -1;
+
+ *out = strtoull(buf, NULL, 10);
+ return 0;
+}
+
+static int mount_cg2_fd(void)
+{
+ int fs_fd, mnt_fd;
+
+ fs_fd = syscall(__NR_fsopen, "cgroup2", 0);
+ if (fs_fd < 0)
+ return -1;
+
+ if (syscall(__NR_fsconfig, fs_fd, FSCONFIG_CMD_CREATE,
+ NULL, NULL, 0) < 0) {
+ close(fs_fd);
+ return -1;
+ }
+
+ mnt_fd = syscall(__NR_fsmount, fs_fd, 0, 0);
+ close(fs_fd);
+ return mnt_fd;
+}
+
+static int enter_cgroup(int cgv2_fd, int ctid)
+{
+ char cg_path[64];
+ char pid_str[64];
+ int fd;
+ int ret;
+
+ if (ctid)
+ snprintf(cg_path, sizeof(cg_path), "%d/cgroup.procs", ctid);
+ else
+ snprintf(cg_path, sizeof(cg_path), "cgroup.procs");
+ fd = openat(cgv2_fd, cg_path, O_WRONLY);
+ if (fd < 0)
+ return -1;
+
+ snprintf(pid_str, sizeof(pid_str), "%d", getpid());
+ ret = write(fd, pid_str, strlen(pid_str));
+ if (ret < 0 || ret != (int)strlen(pid_str))
+ ret = -1;
+
+ close(fd);
+ return ret;
+}
+
+/*
+ * Synchronisation across the clone() boundary: child does its setup,
+ * tells parent it is ready, then blocks until parent acknowledges.
+ */
+struct sync_pipes {
+ int child_to_parent[2];
+ int parent_to_child[2];
+};
+
+static int sync_pipes_init(struct sync_pipes *s)
+{
+ if (pipe(s->child_to_parent) < 0)
+ return -1;
+ if (pipe(s->parent_to_child) < 0) {
+ close(s->child_to_parent[0]);
+ close(s->child_to_parent[1]);
+ return -1;
+ }
+ return 0;
+}
+
+static void sync_pipes_close(struct sync_pipes *s)
+{
+ close(s->child_to_parent[0]);
+ close(s->child_to_parent[1]);
+ close(s->parent_to_child[0]);
+ close(s->parent_to_child[1]);
+}
+
+/*
+ * Per-test context shared between parent and child.
+ */
+struct clone_args_ctx {
+ int cgv2_fd;
+ int ctid;
+ struct sync_pipes sync;
+};
+
+/*
+ * Child of the clone3 test. The interesting work (creating the new
+ * ve / netns / mntns and accounting them to the new ve) was done by
+ * the kernel during clone3 itself, so the child only needs to keep
+ * those namespaces alive while the parent inspects ve.* counters.
+ */
+static int clone_child_func(void *arg)
+{
+ struct clone_args_ctx *ctx = arg;
+ char ack;
+
+ close(ctx->sync.child_to_parent[0]);
+ close(ctx->sync.parent_to_child[1]);
+
+ if (write(ctx->sync.child_to_parent[1], "R", 1) != 1)
+ _exit(11);
+ if (read(ctx->sync.parent_to_child[0], &ack, 1) != 1)
+ _exit(12);
+
+ _exit(0);
+}
+
+/*
+ * Before fix:
+ * - clone path: ve.netns_avail_nr stays at VE_NETNS_MAX and
+ * ve.mnt_nr stays at 0 because copy_net_ns()/copy_mnt_ns()
+ * charged the parent ve via get_exec_env().
+ * - unshare path: the syscall itself returned -EINVAL, so this
+ * check was unreachable.
+ *
+ * After fix: the new ve is charged for the netns and for every mount
+ * copy_tree() puts into the new mntns.
+ */
+static void check_new_ve_owner(struct __test_metadata *_metadata,
+ int cgv2_fd, int ctid)
+{
+ unsigned long long avail, mnt;
+ char path[64];
+
+ snprintf(path, sizeof(path), "%d/ve.netns_avail_nr", ctid);
+ ASSERT_EQ(read_u64_at(cgv2_fd, path, &avail), 0);
+ EXPECT_EQ(avail, VE_NETNS_MAX - 1);
+
+ snprintf(path, sizeof(path), "%d/ve.mnt_nr", ctid);
+ ASSERT_EQ(read_u64_at(cgv2_fd, path, &mnt), 0);
+ EXPECT_GT(mnt, 0);
+}
+
+FIXTURE(ve_ns_owner)
+{
+ int cgv2_fd;
+ int ctid;
+};
+
+FIXTURE_SETUP(ve_ns_owner)
+{
+ unsigned long long initial_mnt_nr;
+ char ctid_str[16];
+ char val[16];
+ char path[64];
+
+ self->cgv2_fd = mount_cg2_fd();
+ ASSERT_GE(self->cgv2_fd, 0);
+
+ ASSERT_EQ(write_file_at(self->cgv2_fd, "cgroup.subtree_control",
+ "+cpuset +cpu +cpuacct +io +memory +hugetlb +pids +rdma +misc +ve"), 0);
+
+ ASSERT_EQ(write_file_at(self->cgv2_fd,
+ "ve.default_sysfs_permissions", "/ rx"), 0);
+ ASSERT_EQ(write_file_at(self->cgv2_fd,
+ "ve.default_sysfs_permissions", "fs rx"), 0);
+ ASSERT_EQ(write_file_at(self->cgv2_fd,
+ "ve.default_sysfs_permissions", "fs/cgroup rw"), 0);
+
+ self->ctid = CTID_MIN;
+ while (self->ctid < CTID_MAX) {
+ snprintf(ctid_str, sizeof(ctid_str), "%d", self->ctid);
+ if (faccessat(self->cgv2_fd, ctid_str, F_OK, 0) != 0 &&
+ errno == ENOENT)
+ break;
+ self->ctid++;
+ }
+ ASSERT_LT(self->ctid, CTID_MAX);
+
+ ASSERT_EQ(mkdirat(self->cgv2_fd, ctid_str, 0755), 0);
+
+ snprintf(path, sizeof(path), "%d/cgroup.controllers_hidden", self->ctid);
+ ASSERT_EQ(write_file_at(self->cgv2_fd, path, "-ve"), 0);
+
+ /*
+ * ve.veid and ve.features are deliberately not configured: the
+ * tests do not call ve.state=START, so a real veid identity and
+ * feature mask are not needed. The owner_ve accounting we are
+ * checking happens during clone3()/unshare() regardless.
+ *
+ * Cap the new ve's netns count so we can detect a single new
+ * netns being accounted to it. We only assert against the new
+ * ve's counter; the parent ve's counter is shared with the rest
+ * of the host (systemd, container managers, ...) and is racy.
+ */
+ snprintf(path, sizeof(path), "%d/ve.netns_max_nr", self->ctid);
+ snprintf(val, sizeof(val), "%d", VE_NETNS_MAX);
+ ASSERT_EQ(write_file_at(self->cgv2_fd, path, val), 0);
+
+ /*
+ * The new ve cgroup has not been entered by anything yet, so its
+ * mnt_nr counter must start at 0. Each test below verifies that
+ * the clone/unshare populates the new mntns under this ve, i.e.
+ * mnt_nr rises strictly above zero.
+ */
+ snprintf(path, sizeof(path), "%d/ve.mnt_nr", self->ctid);
+ ASSERT_EQ(read_u64_at(self->cgv2_fd, path, &initial_mnt_nr), 0);
+ ASSERT_EQ(initial_mnt_nr, 0);
+};
+
+FIXTURE_TEARDOWN(ve_ns_owner)
+{
+ char path[64];
+
+ enter_cgroup(self->cgv2_fd, 0);
+ snprintf(path, sizeof(path), "%d", self->ctid);
+ unlinkat(self->cgv2_fd, path, AT_REMOVEDIR);
+ close(self->cgv2_fd);
+}
+
+/*
+ * clone3(CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS): the new netns and
+ * mntns must be accounted to the *new* ve, not to the parent ve.
+ *
+ * Before the fix, copy_net_ns()/copy_mnt_ns() looked up the owning ve
+ * via get_exec_env(), which inside copy_namespaces() still resolved to
+ * the parent ve - copy_ve_ns() had stored the new ve on the child but
+ * the running task was still the parent. After the fix copy_namespaces()
+ * forwards tsk->task_ve to both helpers and the right ve is charged.
+ */
+TEST_F(ve_ns_owner, clone_newve_newnet_newns)
+{
+ struct clone_args_ctx ctx = {
+ .cgv2_fd = self->cgv2_fd,
+ .ctid = self->ctid,
+ };
+ struct clone_args cargs = {
+ .flags = CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS,
+ .exit_signal = SIGCHLD,
+ };
+ char ready;
+ int status;
+ pid_t pid;
+
+ ASSERT_EQ(sync_pipes_init(&ctx.sync), 0);
+ ASSERT_GE(enter_cgroup(self->cgv2_fd, self->ctid), 0);
+
+ pid = syscall(__NR_clone3, &cargs, sizeof(cargs));
+ ASSERT_GE(pid, 0);
+ if (pid == 0)
+ clone_child_func(&ctx);
+
+ close(ctx.sync.child_to_parent[1]);
+ close(ctx.sync.parent_to_child[0]);
+
+ ASSERT_GE(enter_cgroup(self->cgv2_fd, 0), 0);
+
+ /* Wait for the child to be settled before measuring. */
+ ASSERT_EQ(read(ctx.sync.child_to_parent[0], &ready, 1), 1);
+ ASSERT_EQ(ready, 'R');
+
+ check_new_ve_owner(_metadata, self->cgv2_fd, self->ctid);
+
+ /* Release the child. */
+ ASSERT_EQ(write(ctx.sync.parent_to_child[1], "G", 1), 1);
+ ASSERT_GE(waitpid(pid, &status, 0), 0);
+ ASSERT_TRUE(WIFEXITED(status));
+ ASSERT_EQ(WEXITSTATUS(status), 0);
+
+ sync_pipes_close(&ctx.sync);
+}
+
+/*
+ * unshare(CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS) used to be
+ * rejected with -EINVAL by check_unshare_flags() because of the same
+ * "wrong owner_ve" problem. After the fix the syscall succeeds and
+ * the resulting net/mnt namespaces are owned by the new ve.
+ *
+ * The accounting check is identical to the clone3 case and uses the
+ * same check_new_ve_owner() helper.
+ */
+static int unshare_child_func(struct clone_args_ctx *ctx)
+{
+ char ack;
+
+ close(ctx->sync.child_to_parent[0]);
+ close(ctx->sync.parent_to_child[1]);
+
+ if (unshare(CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS) < 0)
+ return 21;
+
+ if (write(ctx->sync.child_to_parent[1], "R", 1) != 1)
+ return 22;
+ if (read(ctx->sync.parent_to_child[0], &ack, 1) != 1)
+ return 23;
+
+ return 0;
+}
+
+TEST_F(ve_ns_owner, unshare_newve_newnet_newns)
+{
+ struct clone_args_ctx ctx = {
+ .cgv2_fd = self->cgv2_fd,
+ .ctid = self->ctid,
+ };
+ char ready;
+ int status;
+ pid_t pid;
+
+ ASSERT_EQ(sync_pipes_init(&ctx.sync), 0);
+ ASSERT_GE(enter_cgroup(self->cgv2_fd, self->ctid), 0);
+
+ /*
+ * Use a plain fork: the unshare(CLONE_NEWVE) is performed by
+ * the child itself, since unshare requires single-threaded
+ * context and we don't want to permanently move the test
+ * process into a new ve.
+ */
+ pid = fork();
+ ASSERT_GE(pid, 0);
+ if (pid == 0)
+ _exit(unshare_child_func(&ctx));
+
+ close(ctx.sync.child_to_parent[1]);
+ close(ctx.sync.parent_to_child[0]);
+
+ ASSERT_GE(enter_cgroup(self->cgv2_fd, 0), 0);
+
+ /*
+ * If the child failed before writing 'R' the pipe read returns
+ * 0 (EOF). Reap the child first so the assertion message
+ * carries the meaningful exit code rather than a generic
+ * "expected 1, got 0".
+ */
+ ASSERT_EQ(read(ctx.sync.child_to_parent[0], &ready, 1), 1);
+ ASSERT_EQ(ready, 'R');
+
+ check_new_ve_owner(_metadata, self->cgv2_fd, self->ctid);
+
+ ASSERT_EQ(write(ctx.sync.parent_to_child[1], "G", 1), 1);
+ ASSERT_GE(waitpid(pid, &status, 0), 0);
+ ASSERT_TRUE(WIFEXITED(status));
+ ASSERT_EQ(WEXITSTATUS(status), 0);
+
+ sync_pipes_close(&ctx.sync);
+}
+
+TEST_HARNESS_MAIN
More information about the Devel
mailing list