[Devel] [PATCH v2 VZ9] ve: rework task_ve pointer reset for zombies

Pavel Tikhomirov ptikhomirov at virtuozzo.com
Mon Jan 27 10:36:22 MSK 2025


We don't have to do the reset for "normal" tasks (basically any task
inside container process tree except init). As they hold indirect
reference to ve via container init process, which will reap them before
releasing its reference to ve. So let's not do it.

We see that vzps does not consider such "normal" zombies a part of
container, if we drop task_ve for them. Thus this leads to customer
confusion as those zombies look like they've escaped the container,
which may suggest that other non-zombie processes can also do the same
and that there is a security problem.

Also this fixes proc_exit_connector (which uses task_ve and runs just
after cgroup_exit), in VE as now for "normall" processes it would now
correctly send "exit" events to VE.

https://virtuozzo.atlassian.net/browse/ASUP-836
https://virtuozzo.atlassian.net/browse/PSBM-160298
Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
---
v2: no need to check for init as it's pid should also resolve to ve
pid namespace, at the same time fix crash in resolve as it needs non
zero ve_ns.
---
 kernel/ve/ve.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index ec14691970d6e..03f230f988d39 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -32,6 +32,7 @@
 #include <linux/ctype.h>
 #include <linux/tty.h>
 #include <linux/device.h>
+#include <linux/sched.h>
 
 #include <uapi/linux/vzcalluser.h>
 #include <net/rtnetlink.h>
@@ -1215,20 +1216,33 @@ static void ve_attach(struct cgroup_taskset *tset)
  *
  * We need to make sure that this task's ->task_ve is not pointing to the
  * container ve before putting the reference, else the use of ->task_ve from
- * the task can lead to use-after-free (e.g. reading /proc/pid/stat).
+ * the task can lead to use-after-free (e.g. when reading /proc/pid/stat).
  *
- * One exception to that rule is the init process of the container, it needs
- * ->task_ve in ve_stop_ns / ve_exit_ns, and holds a reference to ve_struct.
+ * This is not a problem for a "normal" task (basically any task inside
+ * container process tree except init), as such a "normal" task will be reaped
+ * in zap_pid_ns_processes by the init, before init releases it's own reference
+ * to ve.
+ *
+ * So we have two cases:
+ * - init is handled in ve_exit_ns, unless there is no ve_ns;
+ * - "external" task (the task which does not resolve to a valid pid in
+ *   container pid namespace, and thus there is no guaranty that it would be
+ *   reaped in time) is handled below.
  */
 static void ve_exit(struct task_struct *task)
 {
 	struct ve_struct *ve = task->task_ve;
 
 	/*
-	 * Pairs with synchronize_rcu in ve_grab_context and ve_drop_context
+	 * Pairs with synchronize_rcu in ve_grab_context and ve_drop_context.
 	 */
 	rcu_read_lock();
-	if (!ve->ve_ns || ve->ve_ns->pid_ns_for_children->child_reaper != task)
+	/*
+	 * Clear task_ve if ve has no namespaces (ve is starting, stopped or
+	 * stopping), or in case of "external" task.
+	 */
+	if (!ve->ve_ns ||
+	    !task_pid_nr_ns(task, ve->ve_ns->pid_ns_for_children))
 		rcu_assign_pointer(task->task_ve, &ve0);
 	rcu_read_unlock();
 }
-- 
2.48.1



More information about the Devel mailing list