[Devel] [PATCH VZ7] ve: rework task_ve pointer reset for zombies
Pavel Tikhomirov
ptikhomirov at virtuozzo.com
Mon Jan 27 08:39:55 MSK 2025
We don't have to do the reset for "normal" tasks (basically any task
inside container process tree except init). As they hold indirect
reference to ve via container init process, which will reap them before
releasing its reference to ve. So let's not do it.
We see that vzps does not consider such "normal" zombies a part of
container, if we drop task_ve for them. Thus this leads to customer
confusion as those zombies look like they've escaped the container,
which may suggest that other non-zombie processes can also do the same
and that there is a security problem.
Also this fixes proc_exit_connector (which uses task_ve and runs just
after cgroup_exit), in VE as now for "normall" processes it would now
correctly send "exit" events to VE.
https://virtuozzo.atlassian.net/browse/ASUP-836
https://virtuozzo.atlassian.net/browse/PSBM-160298
Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
---
kernel/ve/ve.c | 27 +++++++++++++++++++++------
1 file changed, 21 insertions(+), 6 deletions(-)
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index ff81fce4adcd0..928f4162f2bd8 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -1131,21 +1131,36 @@ static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset)
*
* We need to make sure that this task's ->task_ve is not pointing to the
* container ve before putting the reference, else the use of ->task_ve from
- * the task can lead to use-after-free (e.g. reading /proc/pid/stat).
+ * the task can lead to use-after-free (e.g. when reading /proc/pid/stat).
*
- * One exception to that rule is the init process of the container, it needs
- * ->task_ve in ve_stop_ns / ve_exit_ns, and holds a reference to ve_struct.
+ * This is not a problem for a "normal" task (basically any task inside
+ * container process tree except init), as such a "normal" task will be reaped
+ * in zap_pid_ns_processes by the init, before init releases it's own reference
+ * to ve.
+ *
+ * So we have two cases:
+ * - init is handled in ve_exit_ns;
+ * - "external" task (the task which does not resolve to a valid pid in
+ * container pid namespace, and thus there is no guaranty that it would be
+ * reaped in time) is handled below.
*/
static void ve_exit(struct cgroup *cg, struct cgroup *oldcg, struct task_struct *task)
{
struct ve_struct *ve = task->task_ve;
/*
- * Pairs with synchronize_rcu in ve_grab_context and ve_drop_context
+ * Pairs with synchronize_rcu in ve_grab_context and ve_drop_context.
*/
rcu_read_lock();
- if (!ve->ve_ns || ve->init_task != task)
- rcu_assign_pointer(task->task_ve, &ve0);
+ if (!ve->ve_ns || ve->init_task != task) {
+ /*
+ * If task resolves to pid zero in ve pidns,
+ * it is outside ve process tree.
+ */
+ if (!pid_nr_ns(rcu_dereference(task->pids[PIDTYPE_PID].pid),
+ ve->ve_ns->pid_ns))
+ rcu_assign_pointer(task->task_ve, &ve0);
+ }
rcu_read_unlock();
}
--
2.48.1
More information about the Devel
mailing list