[Devel] [PATCH VZ7] ve: rework task_ve pointer reset for zombies
Pavel Tikhomirov
ptikhomirov at virtuozzo.com
Mon Jan 27 09:46:24 MSK 2025
please disregard, there would be v2
On 1/27/25 13:39, Pavel Tikhomirov wrote:
> We don't have to do the reset for "normal" tasks (basically any task
> inside container process tree except init). As they hold indirect
> reference to ve via container init process, which will reap them before
> releasing its reference to ve. So let's not do it.
>
> We see that vzps does not consider such "normal" zombies a part of
> container, if we drop task_ve for them. Thus this leads to customer
> confusion as those zombies look like they've escaped the container,
> which may suggest that other non-zombie processes can also do the same
> and that there is a security problem.
>
> Also this fixes proc_exit_connector (which uses task_ve and runs just
> after cgroup_exit), in VE as now for "normall" processes it would now
> correctly send "exit" events to VE.
>
> https://virtuozzo.atlassian.net/browse/ASUP-836
> https://virtuozzo.atlassian.net/browse/PSBM-160298
> Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
> ---
> kernel/ve/ve.c | 27 +++++++++++++++++++++------
> 1 file changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
> index ff81fce4adcd0..928f4162f2bd8 100644
> --- a/kernel/ve/ve.c
> +++ b/kernel/ve/ve.c
> @@ -1131,21 +1131,36 @@ static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset)
> *
> * We need to make sure that this task's ->task_ve is not pointing to the
> * container ve before putting the reference, else the use of ->task_ve from
> - * the task can lead to use-after-free (e.g. reading /proc/pid/stat).
> + * the task can lead to use-after-free (e.g. when reading /proc/pid/stat).
> *
> - * One exception to that rule is the init process of the container, it needs
> - * ->task_ve in ve_stop_ns / ve_exit_ns, and holds a reference to ve_struct.
> + * This is not a problem for a "normal" task (basically any task inside
> + * container process tree except init), as such a "normal" task will be reaped
> + * in zap_pid_ns_processes by the init, before init releases it's own reference
> + * to ve.
> + *
> + * So we have two cases:
> + * - init is handled in ve_exit_ns;
> + * - "external" task (the task which does not resolve to a valid pid in
> + * container pid namespace, and thus there is no guaranty that it would be
> + * reaped in time) is handled below.
> */
> static void ve_exit(struct cgroup *cg, struct cgroup *oldcg, struct task_struct *task)
> {
> struct ve_struct *ve = task->task_ve;
>
> /*
> - * Pairs with synchronize_rcu in ve_grab_context and ve_drop_context
> + * Pairs with synchronize_rcu in ve_grab_context and ve_drop_context.
> */
> rcu_read_lock();
> - if (!ve->ve_ns || ve->init_task != task)
> - rcu_assign_pointer(task->task_ve, &ve0);
> + if (!ve->ve_ns || ve->init_task != task) {
> + /*
> + * If task resolves to pid zero in ve pidns,
> + * it is outside ve process tree.
> + */
> + if (!pid_nr_ns(rcu_dereference(task->pids[PIDTYPE_PID].pid),
> + ve->ve_ns->pid_ns))
> + rcu_assign_pointer(task->task_ve, &ve0);
> + }
> rcu_read_unlock();
> }
>
--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
More information about the Devel
mailing list