[Devel] [PATCH RHEL10 COMMIT] selftests/ve: regression test for CLONE_NEWVE owner correctness

Thu May 14 18:53:28 MSK 2026

The commit is pushed to "branch-rh10-6.12.0-55.52.1.5.x.vz10-ovz" and will appear at git at bitbucket.org:openvz/vzkernel.git
after rh10-6.12.0-55.52.1.5.24.vz10
------>
commit 4fe045fa7c00aa399c3e93745a416caef58fb971
Author: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
Date:   Wed Apr 29 15:41:42 2026 +0200

    selftests/ve: regression test for CLONE_NEWVE owner correctness
    
    Add a small kselftest that exercises the case fixed by the preceding
    patches: combining CLONE_NEWVE with CLONE_NEWNET and CLONE_NEWNS in
    a single clone3() or unshare(), and verifying that the resulting net
    and mount namespaces are owned by the *new* ve, not the parent ve.
    
    Two test cases share the same shape:
    
      clone_newve_newnet_newns
        Do clone3(CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS) in a fresh
        ve cgroup. Once the child signals readiness via a sync pipe,
        the parent inspects the new ve's counters.
    
      unshare_newve_newnet_newns
        A fork()ed child does unshare(CLONE_NEWVE | CLONE_NEWNET |
        CLONE_NEWNS) in a fresh ve cgroup. Once the child signals readiness
        via a sync pipe, the parent inspects the new ve's counters.
    
    Both tests call a single check_new_ve_owner() helper that asserts:
    
      ve.netns_avail_nr == VE_NETNS_MAX - 1
        FIXTURE_SETUP caps the new ve's ve.netns_max_nr to a small
        value (3) so the single netns charged to it is unambiguously
        detectable. Pre-fix this counter would have stayed at the cap
        because copy_net_ns() charged the parent ve via get_exec_env().
    
      ve.mnt_nr > 0
        copy_mnt_ns() / copy_tree() populates the new mntns by cloning
        the parent's mounts; each clone is charged to the new ve via
        ve_mount_nr_inc(). FIXTURE_SETUP additionally asserts that
        ve.mnt_nr starts at 0 on the freshly created ve cgroup, so the
        post-clone '> 0' assertion has well-defined meaning. Pre-fix
        the counter would have stayed at 0 (mounts charged to parent).
    
    Note: The mount-side check relies on the ve.mnt_nr cgroup file added by
    the preceding patch.
    
    Wire the new directory into tools/testing/selftests/Makefile.
    
    https://virtuozzo.atlassian.net/browse/VSTOR-129744
    Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
    Reviewed-by: Vasileios Almpanis <vasileios.almpanis at virtuozzo.com>
    
    Feature: ve: ve generic structures
    ======
    Patchset description:
    ve: fix owner_ve of net/mnt namespaces created together with CLONE_NEWVE
    
    When CLONE_NEWVE is combined with CLONE_NEWNET and/or CLONE_NEWNS in a
    single clone3() or unshare(), copy_net_ns() and copy_mnt_ns() resolve
    the owning ve via get_exec_env(), which still points at the parent ve
    at that point. The freshly created net/mnt namespaces end up wired to
    the wrong ve, and unshare(CLONE_NEWVE | CLONE_NEW{NS,NET}) is rejected
    outright by check_unshare_flags().
    
    Fix it by threading the new ve from copy_namespaces() and
    unshare_nsproxy_namespaces() down into copy_net_ns() and copy_mnt_ns(),
    so the correct ve is charged for the new netns and for every mount in
    the new mntns.
    
    Patches 1-4 are pure plumbing (signature changes, no behaviour change).
    Patch 5 is the actual fix that forwards the new ve. Patch 6 drops the
    now-redundant CLONE_NEWVE-alone restriction in check_unshare_flags().
    Patch 7 exposes ve.mnt_nr via cgroupfs to make per-ve mount accounting
    observable from userspace. Patch 8 adds a selftest covering both the
    clone3() and unshare() paths.
    
    Verified with crash on a vzctl-started container: task_ve,
    nsproxy->net_ns->owner_ve, nsproxy->mnt_ns->ve_owner and
    nsproxy->mnt_ns->root.ve_owner all resolve to the new ve.
    The new selftest passes both cases.
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/ve/.gitignore         |   1 +
 tools/testing/selftests/ve/Makefile           |   7 +
 tools/testing/selftests/ve/ve_ns_owner_test.c | 425 ++++++++++++++++++++++++++
 4 files changed, 434 insertions(+)

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index b30e459572022..700ee6bd916fd 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -113,6 +113,7 @@ TARGETS += tty
 TARGETS += uevent
 TARGETS += user_events
 TARGETS += vDSO
+TARGETS += ve
 TARGETS += ve_printk
 TARGETS += mm
 TARGETS += x86
diff --git a/tools/testing/selftests/ve/.gitignore b/tools/testing/selftests/ve/.gitignore
new file mode 100644
index 0000000000000..7bfff054f9e86
--- /dev/null
+++ b/tools/testing/selftests/ve/.gitignore
@@ -0,0 +1 @@
+ve_ns_owner_test
diff --git a/tools/testing/selftests/ve/Makefile b/tools/testing/selftests/ve/Makefile
new file mode 100644
index 0000000000000..aa03ab02dda9d
--- /dev/null
+++ b/tools/testing/selftests/ve/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for ve selftests.
+CFLAGS += -g -Wall -O2
+
+TEST_GEN_PROGS += ve_ns_owner_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/ve/ve_ns_owner_test.c b/tools/testing/selftests/ve/ve_ns_owner_test.c
new file mode 100644
index 0000000000000..1f82955eb4c34
--- /dev/null
+++ b/tools/testing/selftests/ve/ve_ns_owner_test.c
@@ -0,0 +1,425 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ve_ns_owner selftests
+ *
+ * Regression tests for the case where CLONE_NEWVE is combined with
+ * CLONE_NEWNET and CLONE_NEWNS in a single clone3() or unshare()
+ * syscall.
+ *
+ * Historically copy_net_ns() and copy_mnt_ns() resolved the owning ve
+ * via get_exec_env() at the time of the call, which since we've
+ * switched from cgroup based to namespace based get_exec_env() pointed
+ * at the parent's ve when copy_ve_ns() had just installed a new ve on
+ * the child (clone path) or before unshare_ve_namespace()'s result was
+ * committed (unshare path). That left the freshly created network and
+ * mount namespaces wired to the wrong ve.
+ *
+ * Both tests follow the same shape: a child blocks inside a fresh ve
+ * with a new netns and mntns, the parent reads the new ve's counters
+ * via cgroupfs and asserts they reflect the just-created namespaces:
+ *   - ve.netns_avail_nr drops by exactly one (the new netns);
+ *   - ve.mnt_nr is strictly greater than zero (mounts copied into the
+ *     new mntns are accounted to the new ve).
+ *
+ * We never assert against the parent ve's counters: those are shared
+ * with everything else on the host (systemd, container managers, ...)
+ * and observing them is racy.
+ */
+#define _GNU_SOURCE
+#include <linux/sched.h>
+#include <linux/mount.h>
+#include <sched.h>
+#include <sys/wait.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+#include <asm/unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/mount.h>
+#include <linux/limits.h>
+#include <errno.h>
+
+#include "../kselftest_harness.h"
+
+#define CTID_MIN		108
+#define CTID_MAX		200
+
+#ifndef CLONE_NEWVE
+#define CLONE_NEWVE		0x00000040
+#endif
+
+/*
+ * Make ve.netns_avail_nr movements easy to detect: a small cap means
+ * any spurious accounting against the parent ve would overflow it.
+ */
+#define VE_NETNS_MAX		3
+
+static int write_file_at(int dirfd, const char *path, const char *val)
+{
+	int fd, ret;
+	size_t len = strlen(val);
+
+	fd = openat(dirfd, path, O_WRONLY);
+	if (fd < 0)
+		return -1;
+
+	ret = write(fd, val, len);
+	close(fd);
+	return (ret == (int)len) ? 0 : -1;
+}
+
+static int read_u64_at(int dirfd, const char *path, unsigned long long *out)
+{
+	char buf[32] = {0};
+	int fd, ret;
+
+	fd = openat(dirfd, path, O_RDONLY);
+	if (fd < 0)
+		return -1;
+
+	ret = read(fd, buf, sizeof(buf) - 1);
+	close(fd);
+	if (ret <= 0)
+		return -1;
+
+	*out = strtoull(buf, NULL, 10);
+	return 0;
+}
+
+static int mount_cg2_fd(void)
+{
+	int fs_fd, mnt_fd;
+
+	fs_fd = syscall(__NR_fsopen, "cgroup2", 0);
+	if (fs_fd < 0)
+		return -1;
+
+	if (syscall(__NR_fsconfig, fs_fd, FSCONFIG_CMD_CREATE,
+		    NULL, NULL, 0) < 0) {
+		close(fs_fd);
+		return -1;
+	}
+
+	mnt_fd = syscall(__NR_fsmount, fs_fd, 0, 0);
+	close(fs_fd);
+	return mnt_fd;
+}
+
+static int enter_cgroup(int cgv2_fd, int ctid)
+{
+	char cg_path[64];
+	char pid_str[64];
+	int fd;
+	int ret;
+
+	if (ctid)
+		snprintf(cg_path, sizeof(cg_path), "%d/cgroup.procs", ctid);
+	else
+		snprintf(cg_path, sizeof(cg_path), "cgroup.procs");
+	fd = openat(cgv2_fd, cg_path, O_WRONLY);
+	if (fd < 0)
+		return -1;
+
+	snprintf(pid_str, sizeof(pid_str), "%d", getpid());
+	ret = write(fd, pid_str, strlen(pid_str));
+	if (ret < 0 || ret != (int)strlen(pid_str))
+		ret = -1;
+
+	close(fd);
+	return ret;
+}
+
+/*
+ * Synchronisation across the clone() boundary: child does its setup,
+ * tells parent it is ready, then blocks until parent acknowledges.
+ */
+struct sync_pipes {
+	int child_to_parent[2];
+	int parent_to_child[2];
+};
+
+static int sync_pipes_init(struct sync_pipes *s)
+{
+	if (pipe(s->child_to_parent) < 0)
+		return -1;
+	if (pipe(s->parent_to_child) < 0) {
+		close(s->child_to_parent[0]);
+		close(s->child_to_parent[1]);
+		return -1;
+	}
+	return 0;
+}
+
+static void sync_pipes_close(struct sync_pipes *s)
+{
+	close(s->child_to_parent[0]);
+	close(s->child_to_parent[1]);
+	close(s->parent_to_child[0]);
+	close(s->parent_to_child[1]);
+}
+
+/*
+ * Per-test context shared between parent and child.
+ */
+struct clone_args_ctx {
+	int cgv2_fd;
+	int ctid;
+	struct sync_pipes sync;
+};
+
+/*
+ * Child of the clone3 test. The interesting work (creating the new
+ * ve / netns / mntns and accounting them to the new ve) was done by
+ * the kernel during clone3 itself, so the child only needs to keep
+ * those namespaces alive while the parent inspects ve.* counters.
+ */
+static int clone_child_func(void *arg)
+{
+	struct clone_args_ctx *ctx = arg;
+	char ack;
+
+	close(ctx->sync.child_to_parent[0]);
+	close(ctx->sync.parent_to_child[1]);
+
+	if (write(ctx->sync.child_to_parent[1], "R", 1) != 1)
+		_exit(11);
+	if (read(ctx->sync.parent_to_child[0], &ack, 1) != 1)
+		_exit(12);
+
+	_exit(0);
+}
+
+/*
+ * Before fix:
+ *   - clone path: ve.netns_avail_nr stays at VE_NETNS_MAX and
+ *     ve.mnt_nr stays at 0 because copy_net_ns()/copy_mnt_ns()
+ *     charged the parent ve via get_exec_env().
+ *   - unshare path: the syscall itself returned -EINVAL, so this
+ *     check was unreachable.
+ *
+ * After fix: the new ve is charged for the netns and for every mount
+ * copy_tree() puts into the new mntns.
+ */
+static void check_new_ve_owner(struct __test_metadata *_metadata,
+			       int cgv2_fd, int ctid)
+{
+	unsigned long long avail, mnt;
+	char path[64];
+
+	snprintf(path, sizeof(path), "%d/ve.netns_avail_nr", ctid);
+	ASSERT_EQ(read_u64_at(cgv2_fd, path, &avail), 0);
+	EXPECT_EQ(avail, VE_NETNS_MAX - 1);
+
+	snprintf(path, sizeof(path), "%d/ve.mnt_nr", ctid);
+	ASSERT_EQ(read_u64_at(cgv2_fd, path, &mnt), 0);
+	EXPECT_GT(mnt, 0);
+}
+
+FIXTURE(ve_ns_owner)
+{
+	int cgv2_fd;
+	int ctid;
+};
+
+FIXTURE_SETUP(ve_ns_owner)
+{
+	unsigned long long initial_mnt_nr;
+	char ctid_str[16];
+	char val[16];
+	char path[64];
+
+	self->cgv2_fd = mount_cg2_fd();
+	ASSERT_GE(self->cgv2_fd, 0);
+
+	ASSERT_EQ(write_file_at(self->cgv2_fd, "cgroup.subtree_control",
+		  "+cpuset +cpu +cpuacct +io +memory +hugetlb +pids +rdma +misc +ve"), 0);
+
+	ASSERT_EQ(write_file_at(self->cgv2_fd,
+		  "ve.default_sysfs_permissions", "/ rx"), 0);
+	ASSERT_EQ(write_file_at(self->cgv2_fd,
+		  "ve.default_sysfs_permissions", "fs rx"), 0);
+	ASSERT_EQ(write_file_at(self->cgv2_fd,
+		  "ve.default_sysfs_permissions", "fs/cgroup rw"), 0);
+
+	self->ctid = CTID_MIN;
+	while (self->ctid < CTID_MAX) {
+		snprintf(ctid_str, sizeof(ctid_str), "%d", self->ctid);
+		if (faccessat(self->cgv2_fd, ctid_str, F_OK, 0) != 0 &&
+		    errno == ENOENT)
+			break;
+		self->ctid++;
+	}
+	ASSERT_LT(self->ctid, CTID_MAX);
+
+	ASSERT_EQ(mkdirat(self->cgv2_fd, ctid_str, 0755), 0);
+
+	snprintf(path, sizeof(path), "%d/cgroup.controllers_hidden", self->ctid);
+	ASSERT_EQ(write_file_at(self->cgv2_fd, path, "-ve"), 0);
+
+	/*
+	 * ve.veid and ve.features are deliberately not configured: the
+	 * tests do not call ve.state=START, so a real veid identity and
+	 * feature mask are not needed. The owner_ve accounting we are
+	 * checking happens during clone3()/unshare() regardless.
+	 *
+	 * Cap the new ve's netns count so we can detect a single new
+	 * netns being accounted to it. We only assert against the new
+	 * ve's counter; the parent ve's counter is shared with the rest
+	 * of the host (systemd, container managers, ...) and is racy.
+	 */
+	snprintf(path, sizeof(path), "%d/ve.netns_max_nr", self->ctid);
+	snprintf(val, sizeof(val), "%d", VE_NETNS_MAX);
+	ASSERT_EQ(write_file_at(self->cgv2_fd, path, val), 0);
+
+	/*
+	 * The new ve cgroup has not been entered by anything yet, so its
+	 * mnt_nr counter must start at 0. Each test below verifies that
+	 * the clone/unshare populates the new mntns under this ve, i.e.
+	 * mnt_nr rises strictly above zero.
+	 */
+	snprintf(path, sizeof(path), "%d/ve.mnt_nr", self->ctid);
+	ASSERT_EQ(read_u64_at(self->cgv2_fd, path, &initial_mnt_nr), 0);
+	ASSERT_EQ(initial_mnt_nr, 0);
+};
+
+FIXTURE_TEARDOWN(ve_ns_owner)
+{
+	char path[64];
+
+	enter_cgroup(self->cgv2_fd, 0);
+	snprintf(path, sizeof(path), "%d", self->ctid);
+	unlinkat(self->cgv2_fd, path, AT_REMOVEDIR);
+	close(self->cgv2_fd);
+}
+
+/*
+ * clone3(CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS): the new netns and
+ * mntns must be accounted to the *new* ve, not to the parent ve.
+ *
+ * Before the fix, copy_net_ns()/copy_mnt_ns() looked up the owning ve
+ * via get_exec_env(), which inside copy_namespaces() still resolved to
+ * the parent ve - copy_ve_ns() had stored the new ve on the child but
+ * the running task was still the parent. After the fix copy_namespaces()
+ * forwards tsk->task_ve to both helpers and the right ve is charged.
+ */
+TEST_F(ve_ns_owner, clone_newve_newnet_newns)
+{
+	struct clone_args_ctx ctx = {
+		.cgv2_fd = self->cgv2_fd,
+		.ctid = self->ctid,
+	};
+	struct clone_args cargs = {
+		.flags = CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS,
+		.exit_signal = SIGCHLD,
+	};
+	char ready;
+	int status;
+	pid_t pid;
+
+	ASSERT_EQ(sync_pipes_init(&ctx.sync), 0);
+	ASSERT_GE(enter_cgroup(self->cgv2_fd, self->ctid), 0);
+
+	pid = syscall(__NR_clone3, &cargs, sizeof(cargs));
+	ASSERT_GE(pid, 0);
+	if (pid == 0)
+		clone_child_func(&ctx);
+
+	close(ctx.sync.child_to_parent[1]);
+	close(ctx.sync.parent_to_child[0]);
+
+	ASSERT_GE(enter_cgroup(self->cgv2_fd, 0), 0);
+
+	/* Wait for the child to be settled before measuring. */
+	ASSERT_EQ(read(ctx.sync.child_to_parent[0], &ready, 1), 1);
+	ASSERT_EQ(ready, 'R');
+
+	check_new_ve_owner(_metadata, self->cgv2_fd, self->ctid);
+
+	/* Release the child. */
+	ASSERT_EQ(write(ctx.sync.parent_to_child[1], "G", 1), 1);
+	ASSERT_GE(waitpid(pid, &status, 0), 0);
+	ASSERT_TRUE(WIFEXITED(status));
+	ASSERT_EQ(WEXITSTATUS(status), 0);
+
+	sync_pipes_close(&ctx.sync);
+}
+
+/*
+ * unshare(CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS) used to be
+ * rejected with -EINVAL by check_unshare_flags() because of the same
+ * "wrong owner_ve" problem. After the fix the syscall succeeds and
+ * the resulting net/mnt namespaces are owned by the new ve.
+ *
+ * The accounting check is identical to the clone3 case and uses the
+ * same check_new_ve_owner() helper.
+ */
+static int unshare_child_func(struct clone_args_ctx *ctx)
+{
+	char ack;
+
+	close(ctx->sync.child_to_parent[0]);
+	close(ctx->sync.parent_to_child[1]);
+
+	if (unshare(CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS) < 0)
+		return 21;
+
+	if (write(ctx->sync.child_to_parent[1], "R", 1) != 1)
+		return 22;
+	if (read(ctx->sync.parent_to_child[0], &ack, 1) != 1)
+		return 23;
+
+	return 0;
+}
+
+TEST_F(ve_ns_owner, unshare_newve_newnet_newns)
+{
+	struct clone_args_ctx ctx = {
+		.cgv2_fd = self->cgv2_fd,
+		.ctid = self->ctid,
+	};
+	char ready;
+	int status;
+	pid_t pid;
+
+	ASSERT_EQ(sync_pipes_init(&ctx.sync), 0);
+	ASSERT_GE(enter_cgroup(self->cgv2_fd, self->ctid), 0);
+
+	/*
+	 * Use a plain fork: the unshare(CLONE_NEWVE) is performed by
+	 * the child itself, since unshare requires single-threaded
+	 * context and we don't want to permanently move the test
+	 * process into a new ve.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+	if (pid == 0)
+		_exit(unshare_child_func(&ctx));
+
+	close(ctx.sync.child_to_parent[1]);
+	close(ctx.sync.parent_to_child[0]);
+
+	ASSERT_GE(enter_cgroup(self->cgv2_fd, 0), 0);
+
+	/*
+	 * If the child failed before writing 'R' the pipe read returns
+	 * 0 (EOF). Reap the child first so the assertion message
+	 * carries the meaningful exit code rather than a generic
+	 * "expected 1, got 0".
+	 */
+	ASSERT_EQ(read(ctx.sync.child_to_parent[0], &ready, 1), 1);
+	ASSERT_EQ(ready, 'R');
+
+	check_new_ve_owner(_metadata, self->cgv2_fd, self->ctid);
+
+	ASSERT_EQ(write(ctx.sync.parent_to_child[1], "G", 1), 1);
+	ASSERT_GE(waitpid(pid, &status, 0), 0);
+	ASSERT_TRUE(WIFEXITED(status));
+	ASSERT_EQ(WEXITSTATUS(status), 0);
+
+	sync_pipes_close(&ctx.sync);
+}
+
+TEST_HARNESS_MAIN