[Devel] [PATCH VZ10 8/8] selftests/ve: regression test for CLONE_NEWVE owner correctness
Konstantin Khorenko
khorenko at virtuozzo.com
Thu May 14 18:35:01 MSK 2026
1. Double-close of file descriptors: In both tests the parent closes two pipe ends manually, then
calls sync_pipes_close() which closes all four. This results in a double-close on two fds:
tools/testing/selftests/ve/ve_ns_owner_test.c lines 325-326
close(ctx.sync.child_to_parent[1]);
close(ctx.sync.parent_to_child[0]);
And then at the end:
tools/testing/selftests/ve/ve_ns_owner_test.c lines 341-341
sync_pipes_close(&ctx.sync);
Harmless in a single-threaded test, but technically a bug. Either drop sync_pipes_close() at the
end and close the remaining two fds explicitly, or set closed fds to -1 and check in sync_pipes_close:
static void sync_pipe_close_fd(int *fd)
{
if (*fd >= 0) {
close(*fd);
*fd = -1;
}
}
The same issue exists in the unshare test at lines 400-401 / 413.
======================
2. clone3 child theoretical fallthrough
Here is the current clone_child_func:
tools/testing/selftests/ve/ve_ns_owner_test.c lines 168-182
static int clone_child_func(void *arg)
{
struct clone_args_ctx *ctx = arg;
char ack;
close(ctx->sync.child_to_parent[0]);
close(ctx->sync.parent_to_child[1]);
if (write(ctx->sync.child_to_parent[1], "R", 1) != 1)
_exit(11);
if (read(ctx->sync.parent_to_child[0], &ack, 1) != 1)
_exit(12);
_exit(0);
}
The function always terminates via _exit() - three different paths, all three call _exit. Formally
it is impossible to fall through. But the call site looks like this:
tools/testing/selftests/ve/ve_ns_owner_test.c lines 316-317
if (pid == 0)
clone_child_func(&ctx);
There is no _exit or return after clone_child_func(&ctx). If someone later refactors
clone_child_func and replaces _exit(0) with return 0 (by analogy with unshare_child_func, which does
exactly that), the child will fall through into the parent's test code - two processes will
simultaneously work with the pipes, call waitpid, etc.
For comparison, unshare_child_func is written differently:
tools/testing/selftests/ve/ve_ns_owner_test.c lines 353-369
static int unshare_child_func(struct clone_args_ctx *ctx)
{
char ack;
close(ctx->sync.child_to_parent[0]);
close(ctx->sync.parent_to_child[1]);
if (unshare(CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS) < 0)
return 21;
if (write(ctx->sync.child_to_parent[1], "R", 1) != 1)
return 22;
if (read(ctx->sync.parent_to_child[0], &ack, 1) != 1)
return 23;
return 0;
}
It uses return, and _exit is at the call site:
tools/testing/selftests/ve/ve_ns_owner_test.c lines 395-396
if (pid == 0)
_exit(unshare_child_func(&ctx));
This is safer: _exit is guaranteed at the call site regardless of what the function does
internally. If clone_child_func is brought to the same style (replace _exit(N) with return N and call
it as
_exit(clone_child_func(&ctx))), both paths become uniform and protected against fallthrough.
==========================
3. CTID collision: The free-CTID scan over CTID_MIN..CTID_MAX works but the range is narrow (92
values). If other tests left stale cgroups behind, the scan can exhaust the range and fail. Not
critical for a selftest, but a mkdtemp-like approach (e.g. random or PID-based suffix) would be more
robust.
On 4/29/26 15:41, Pavel Tikhomirov wrote:
> Add a small kselftest that exercises the case fixed by the preceding
> patches: combining CLONE_NEWVE with CLONE_NEWNET and CLONE_NEWNS in
> a single clone3() or unshare(), and verifying that the resulting net
> and mount namespaces are owned by the *new* ve, not the parent ve.
>
> Two test cases share the same shape:
>
> clone_newve_newnet_newns
> Do clone3(CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS) in a fresh
> ve cgroup. Once the child signals readiness via a sync pipe,
> the parent inspects the new ve's counters.
>
> unshare_newve_newnet_newns
> A fork()ed child does unshare(CLONE_NEWVE | CLONE_NEWNET |
> CLONE_NEWNS) in a fresh ve cgroup. Once the child signals readiness
> via a sync pipe, the parent inspects the new ve's counters.
>
> Both tests call a single check_new_ve_owner() helper that asserts:
>
> ve.netns_avail_nr == VE_NETNS_MAX - 1
> FIXTURE_SETUP caps the new ve's ve.netns_max_nr to a small
> value (3) so the single netns charged to it is unambiguously
> detectable. Pre-fix this counter would have stayed at the cap
> because copy_net_ns() charged the parent ve via get_exec_env().
>
> ve.mnt_nr > 0
> copy_mnt_ns() / copy_tree() populates the new mntns by cloning
> the parent's mounts; each clone is charged to the new ve via
> ve_mount_nr_inc(). FIXTURE_SETUP additionally asserts that
> ve.mnt_nr starts at 0 on the freshly created ve cgroup, so the
> post-clone '> 0' assertion has well-defined meaning. Pre-fix
> the counter would have stayed at 0 (mounts charged to parent).
>
> Note: The mount-side check relies on the ve.mnt_nr cgroup file added by
> the preceding patch.
>
> Wire the new directory into tools/testing/selftests/Makefile.
>
> https://virtuozzo.atlassian.net/browse/VSTOR-129744
> Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
> Feature: ve: ve generic structures
> ---
> tools/testing/selftests/Makefile | 1 +
> tools/testing/selftests/ve/.gitignore | 1 +
> tools/testing/selftests/ve/Makefile | 7 +
> tools/testing/selftests/ve/ve_ns_owner_test.c | 425 ++++++++++++++++++
> 4 files changed, 434 insertions(+)
> create mode 100644 tools/testing/selftests/ve/.gitignore
> create mode 100644 tools/testing/selftests/ve/Makefile
> create mode 100644 tools/testing/selftests/ve/ve_ns_owner_test.c
>
> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index b30e45957202..700ee6bd916f 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -113,6 +113,7 @@ TARGETS += tty
> TARGETS += uevent
> TARGETS += user_events
> TARGETS += vDSO
> +TARGETS += ve
> TARGETS += ve_printk
> TARGETS += mm
> TARGETS += x86
> diff --git a/tools/testing/selftests/ve/.gitignore b/tools/testing/selftests/ve/.gitignore
> new file mode 100644
> index 000000000000..7bfff054f9e8
> --- /dev/null
> +++ b/tools/testing/selftests/ve/.gitignore
> @@ -0,0 +1 @@
> +ve_ns_owner_test
> diff --git a/tools/testing/selftests/ve/Makefile b/tools/testing/selftests/ve/Makefile
> new file mode 100644
> index 000000000000..aa03ab02dda9
> --- /dev/null
> +++ b/tools/testing/selftests/ve/Makefile
> @@ -0,0 +1,7 @@
> +# SPDX-License-Identifier: GPL-2.0
> +# Makefile for ve selftests.
> +CFLAGS += -g -Wall -O2
> +
> +TEST_GEN_PROGS += ve_ns_owner_test
> +
> +include ../lib.mk
> diff --git a/tools/testing/selftests/ve/ve_ns_owner_test.c b/tools/testing/selftests/ve/ve_ns_owner_test.c
> new file mode 100644
> index 000000000000..1f82955eb4c3
> --- /dev/null
> +++ b/tools/testing/selftests/ve/ve_ns_owner_test.c
> @@ -0,0 +1,425 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * ve_ns_owner selftests
> + *
> + * Regression tests for the case where CLONE_NEWVE is combined with
> + * CLONE_NEWNET and CLONE_NEWNS in a single clone3() or unshare()
> + * syscall.
> + *
> + * Historically copy_net_ns() and copy_mnt_ns() resolved the owning ve
> + * via get_exec_env() at the time of the call, which since we've
> + * switched from cgroup based to namespace based get_exec_env() pointed
> + * at the parent's ve when copy_ve_ns() had just installed a new ve on
> + * the child (clone path) or before unshare_ve_namespace()'s result was
> + * committed (unshare path). That left the freshly created network and
> + * mount namespaces wired to the wrong ve.
> + *
> + * Both tests follow the same shape: a child blocks inside a fresh ve
> + * with a new netns and mntns, the parent reads the new ve's counters
> + * via cgroupfs and asserts they reflect the just-created namespaces:
> + * - ve.netns_avail_nr drops by exactly one (the new netns);
> + * - ve.mnt_nr is strictly greater than zero (mounts copied into the
> + * new mntns are accounted to the new ve).
> + *
> + * We never assert against the parent ve's counters: those are shared
> + * with everything else on the host (systemd, container managers, ...)
> + * and observing them is racy.
> + */
> +#define _GNU_SOURCE
> +#include <linux/sched.h>
> +#include <linux/mount.h>
> +#include <sched.h>
> +#include <sys/wait.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +#include <asm/unistd.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <fcntl.h>
> +#include <string.h>
> +#include <sys/stat.h>
> +#include <sys/mount.h>
> +#include <linux/limits.h>
> +#include <errno.h>
> +
> +#include "../kselftest_harness.h"
> +
> +#define CTID_MIN 108
> +#define CTID_MAX 200
> +
> +#ifndef CLONE_NEWVE
> +#define CLONE_NEWVE 0x00000040
> +#endif
> +
> +/*
> + * Make ve.netns_avail_nr movements easy to detect: a small cap means
> + * any spurious accounting against the parent ve would overflow it.
> + */
> +#define VE_NETNS_MAX 3
> +
> +static int write_file_at(int dirfd, const char *path, const char *val)
> +{
> + int fd, ret;
> + size_t len = strlen(val);
> +
> + fd = openat(dirfd, path, O_WRONLY);
> + if (fd < 0)
> + return -1;
> +
> + ret = write(fd, val, len);
> + close(fd);
> + return (ret == (int)len) ? 0 : -1;
> +}
> +
> +static int read_u64_at(int dirfd, const char *path, unsigned long long *out)
> +{
> + char buf[32] = {0};
> + int fd, ret;
> +
> + fd = openat(dirfd, path, O_RDONLY);
> + if (fd < 0)
> + return -1;
> +
> + ret = read(fd, buf, sizeof(buf) - 1);
> + close(fd);
> + if (ret <= 0)
> + return -1;
> +
> + *out = strtoull(buf, NULL, 10);
> + return 0;
> +}
> +
> +static int mount_cg2_fd(void)
> +{
> + int fs_fd, mnt_fd;
> +
> + fs_fd = syscall(__NR_fsopen, "cgroup2", 0);
> + if (fs_fd < 0)
> + return -1;
> +
> + if (syscall(__NR_fsconfig, fs_fd, FSCONFIG_CMD_CREATE,
> + NULL, NULL, 0) < 0) {
> + close(fs_fd);
> + return -1;
> + }
> +
> + mnt_fd = syscall(__NR_fsmount, fs_fd, 0, 0);
> + close(fs_fd);
> + return mnt_fd;
> +}
> +
> +static int enter_cgroup(int cgv2_fd, int ctid)
> +{
> + char cg_path[64];
> + char pid_str[64];
> + int fd;
> + int ret;
> +
> + if (ctid)
> + snprintf(cg_path, sizeof(cg_path), "%d/cgroup.procs", ctid);
> + else
> + snprintf(cg_path, sizeof(cg_path), "cgroup.procs");
> + fd = openat(cgv2_fd, cg_path, O_WRONLY);
> + if (fd < 0)
> + return -1;
> +
> + snprintf(pid_str, sizeof(pid_str), "%d", getpid());
> + ret = write(fd, pid_str, strlen(pid_str));
> + if (ret < 0 || ret != (int)strlen(pid_str))
> + ret = -1;
> +
> + close(fd);
> + return ret;
> +}
> +
> +/*
> + * Synchronisation across the clone() boundary: child does its setup,
> + * tells parent it is ready, then blocks until parent acknowledges.
> + */
> +struct sync_pipes {
> + int child_to_parent[2];
> + int parent_to_child[2];
> +};
> +
> +static int sync_pipes_init(struct sync_pipes *s)
> +{
> + if (pipe(s->child_to_parent) < 0)
> + return -1;
> + if (pipe(s->parent_to_child) < 0) {
> + close(s->child_to_parent[0]);
> + close(s->child_to_parent[1]);
> + return -1;
> + }
> + return 0;
> +}
> +
> +static void sync_pipes_close(struct sync_pipes *s)
> +{
> + close(s->child_to_parent[0]);
> + close(s->child_to_parent[1]);
> + close(s->parent_to_child[0]);
> + close(s->parent_to_child[1]);
> +}
> +
> +/*
> + * Per-test context shared between parent and child.
> + */
> +struct clone_args_ctx {
> + int cgv2_fd;
> + int ctid;
> + struct sync_pipes sync;
> +};
> +
> +/*
> + * Child of the clone3 test. The interesting work (creating the new
> + * ve / netns / mntns and accounting them to the new ve) was done by
> + * the kernel during clone3 itself, so the child only needs to keep
> + * those namespaces alive while the parent inspects ve.* counters.
> + */
> +static int clone_child_func(void *arg)
> +{
> + struct clone_args_ctx *ctx = arg;
> + char ack;
> +
> + close(ctx->sync.child_to_parent[0]);
> + close(ctx->sync.parent_to_child[1]);
> +
> + if (write(ctx->sync.child_to_parent[1], "R", 1) != 1)
> + _exit(11);
> + if (read(ctx->sync.parent_to_child[0], &ack, 1) != 1)
> + _exit(12);
> +
> + _exit(0);
> +}
> +
> +/*
> + * Before fix:
> + * - clone path: ve.netns_avail_nr stays at VE_NETNS_MAX and
> + * ve.mnt_nr stays at 0 because copy_net_ns()/copy_mnt_ns()
> + * charged the parent ve via get_exec_env().
> + * - unshare path: the syscall itself returned -EINVAL, so this
> + * check was unreachable.
> + *
> + * After fix: the new ve is charged for the netns and for every mount
> + * copy_tree() puts into the new mntns.
> + */
> +static void check_new_ve_owner(struct __test_metadata *_metadata,
> + int cgv2_fd, int ctid)
> +{
> + unsigned long long avail, mnt;
> + char path[64];
> +
> + snprintf(path, sizeof(path), "%d/ve.netns_avail_nr", ctid);
> + ASSERT_EQ(read_u64_at(cgv2_fd, path, &avail), 0);
> + EXPECT_EQ(avail, VE_NETNS_MAX - 1);
> +
> + snprintf(path, sizeof(path), "%d/ve.mnt_nr", ctid);
> + ASSERT_EQ(read_u64_at(cgv2_fd, path, &mnt), 0);
> + EXPECT_GT(mnt, 0);
> +}
> +
> +FIXTURE(ve_ns_owner)
> +{
> + int cgv2_fd;
> + int ctid;
> +};
> +
> +FIXTURE_SETUP(ve_ns_owner)
> +{
> + unsigned long long initial_mnt_nr;
> + char ctid_str[16];
> + char val[16];
> + char path[64];
> +
> + self->cgv2_fd = mount_cg2_fd();
> + ASSERT_GE(self->cgv2_fd, 0);
> +
> + ASSERT_EQ(write_file_at(self->cgv2_fd, "cgroup.subtree_control",
> + "+cpuset +cpu +cpuacct +io +memory +hugetlb +pids +rdma +misc +ve"), 0);
> +
> + ASSERT_EQ(write_file_at(self->cgv2_fd,
> + "ve.default_sysfs_permissions", "/ rx"), 0);
> + ASSERT_EQ(write_file_at(self->cgv2_fd,
> + "ve.default_sysfs_permissions", "fs rx"), 0);
> + ASSERT_EQ(write_file_at(self->cgv2_fd,
> + "ve.default_sysfs_permissions", "fs/cgroup rw"), 0);
> +
> + self->ctid = CTID_MIN;
> + while (self->ctid < CTID_MAX) {
> + snprintf(ctid_str, sizeof(ctid_str), "%d", self->ctid);
> + if (faccessat(self->cgv2_fd, ctid_str, F_OK, 0) != 0 &&
> + errno == ENOENT)
> + break;
> + self->ctid++;
> + }
> + ASSERT_LT(self->ctid, CTID_MAX);
> +
> + ASSERT_EQ(mkdirat(self->cgv2_fd, ctid_str, 0755), 0);
> +
> + snprintf(path, sizeof(path), "%d/cgroup.controllers_hidden", self->ctid);
> + ASSERT_EQ(write_file_at(self->cgv2_fd, path, "-ve"), 0);
> +
> + /*
> + * ve.veid and ve.features are deliberately not configured: the
> + * tests do not call ve.state=START, so a real veid identity and
> + * feature mask are not needed. The owner_ve accounting we are
> + * checking happens during clone3()/unshare() regardless.
> + *
> + * Cap the new ve's netns count so we can detect a single new
> + * netns being accounted to it. We only assert against the new
> + * ve's counter; the parent ve's counter is shared with the rest
> + * of the host (systemd, container managers, ...) and is racy.
> + */
> + snprintf(path, sizeof(path), "%d/ve.netns_max_nr", self->ctid);
> + snprintf(val, sizeof(val), "%d", VE_NETNS_MAX);
> + ASSERT_EQ(write_file_at(self->cgv2_fd, path, val), 0);
> +
> + /*
> + * The new ve cgroup has not been entered by anything yet, so its
> + * mnt_nr counter must start at 0. Each test below verifies that
> + * the clone/unshare populates the new mntns under this ve, i.e.
> + * mnt_nr rises strictly above zero.
> + */
> + snprintf(path, sizeof(path), "%d/ve.mnt_nr", self->ctid);
> + ASSERT_EQ(read_u64_at(self->cgv2_fd, path, &initial_mnt_nr), 0);
> + ASSERT_EQ(initial_mnt_nr, 0);
> +};
> +
> +FIXTURE_TEARDOWN(ve_ns_owner)
> +{
> + char path[64];
> +
> + enter_cgroup(self->cgv2_fd, 0);
> + snprintf(path, sizeof(path), "%d", self->ctid);
> + unlinkat(self->cgv2_fd, path, AT_REMOVEDIR);
> + close(self->cgv2_fd);
> +}
> +
> +/*
> + * clone3(CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS): the new netns and
> + * mntns must be accounted to the *new* ve, not to the parent ve.
> + *
> + * Before the fix, copy_net_ns()/copy_mnt_ns() looked up the owning ve
> + * via get_exec_env(), which inside copy_namespaces() still resolved to
> + * the parent ve - copy_ve_ns() had stored the new ve on the child but
> + * the running task was still the parent. After the fix copy_namespaces()
> + * forwards tsk->task_ve to both helpers and the right ve is charged.
> + */
> +TEST_F(ve_ns_owner, clone_newve_newnet_newns)
> +{
> + struct clone_args_ctx ctx = {
> + .cgv2_fd = self->cgv2_fd,
> + .ctid = self->ctid,
> + };
> + struct clone_args cargs = {
> + .flags = CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS,
> + .exit_signal = SIGCHLD,
> + };
> + char ready;
> + int status;
> + pid_t pid;
> +
> + ASSERT_EQ(sync_pipes_init(&ctx.sync), 0);
> + ASSERT_GE(enter_cgroup(self->cgv2_fd, self->ctid), 0);
> +
> + pid = syscall(__NR_clone3, &cargs, sizeof(cargs));
> + ASSERT_GE(pid, 0);
> + if (pid == 0)
> + clone_child_func(&ctx);
> +
> + close(ctx.sync.child_to_parent[1]);
> + close(ctx.sync.parent_to_child[0]);
> +
> + ASSERT_GE(enter_cgroup(self->cgv2_fd, 0), 0);
> +
> + /* Wait for the child to be settled before measuring. */
> + ASSERT_EQ(read(ctx.sync.child_to_parent[0], &ready, 1), 1);
> + ASSERT_EQ(ready, 'R');
> +
> + check_new_ve_owner(_metadata, self->cgv2_fd, self->ctid);
> +
> + /* Release the child. */
> + ASSERT_EQ(write(ctx.sync.parent_to_child[1], "G", 1), 1);
> + ASSERT_GE(waitpid(pid, &status, 0), 0);
> + ASSERT_TRUE(WIFEXITED(status));
> + ASSERT_EQ(WEXITSTATUS(status), 0);
> +
> + sync_pipes_close(&ctx.sync);
> +}
> +
> +/*
> + * unshare(CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS) used to be
> + * rejected with -EINVAL by check_unshare_flags() because of the same
> + * "wrong owner_ve" problem. After the fix the syscall succeeds and
> + * the resulting net/mnt namespaces are owned by the new ve.
> + *
> + * The accounting check is identical to the clone3 case and uses the
> + * same check_new_ve_owner() helper.
> + */
> +static int unshare_child_func(struct clone_args_ctx *ctx)
> +{
> + char ack;
> +
> + close(ctx->sync.child_to_parent[0]);
> + close(ctx->sync.parent_to_child[1]);
> +
> + if (unshare(CLONE_NEWVE | CLONE_NEWNET | CLONE_NEWNS) < 0)
> + return 21;
> +
> + if (write(ctx->sync.child_to_parent[1], "R", 1) != 1)
> + return 22;
> + if (read(ctx->sync.parent_to_child[0], &ack, 1) != 1)
> + return 23;
> +
> + return 0;
> +}
> +
> +TEST_F(ve_ns_owner, unshare_newve_newnet_newns)
> +{
> + struct clone_args_ctx ctx = {
> + .cgv2_fd = self->cgv2_fd,
> + .ctid = self->ctid,
> + };
> + char ready;
> + int status;
> + pid_t pid;
> +
> + ASSERT_EQ(sync_pipes_init(&ctx.sync), 0);
> + ASSERT_GE(enter_cgroup(self->cgv2_fd, self->ctid), 0);
> +
> + /*
> + * Use a plain fork: the unshare(CLONE_NEWVE) is performed by
> + * the child itself, since unshare requires single-threaded
> + * context and we don't want to permanently move the test
> + * process into a new ve.
> + */
> + pid = fork();
> + ASSERT_GE(pid, 0);
> + if (pid == 0)
> + _exit(unshare_child_func(&ctx));
> +
> + close(ctx.sync.child_to_parent[1]);
> + close(ctx.sync.parent_to_child[0]);
> +
> + ASSERT_GE(enter_cgroup(self->cgv2_fd, 0), 0);
> +
> + /*
> + * If the child failed before writing 'R' the pipe read returns
> + * 0 (EOF). Reap the child first so the assertion message
> + * carries the meaningful exit code rather than a generic
> + * "expected 1, got 0".
> + */
> + ASSERT_EQ(read(ctx.sync.child_to_parent[0], &ready, 1), 1);
> + ASSERT_EQ(ready, 'R');
> +
> + check_new_ve_owner(_metadata, self->cgv2_fd, self->ctid);
> +
> + ASSERT_EQ(write(ctx.sync.parent_to_child[1], "G", 1), 1);
> + ASSERT_GE(waitpid(pid, &status, 0), 0);
> + ASSERT_TRUE(WIFEXITED(status));
> + ASSERT_EQ(WEXITSTATUS(status), 0);
> +
> + sync_pipes_close(&ctx.sync);
> +}
> +
> +TEST_HARNESS_MAIN
More information about the Devel
mailing list