[Devel] [PATCH VZ7] ve: rework task_ve pointer reset for zombies

Pavel Tikhomirov ptikhomirov at virtuozzo.com
Mon Jan 27 09:46:24 MSK 2025


please disregard, there would be v2

On 1/27/25 13:39, Pavel Tikhomirov wrote:
> We don't have to do the reset for "normal" tasks (basically any task
> inside container process tree except init). As they hold indirect
> reference to ve via container init process, which will reap them before
> releasing its reference to ve. So let's not do it.
> 
> We see that vzps does not consider such "normal" zombies a part of
> container, if we drop task_ve for them. Thus this leads to customer
> confusion as those zombies look like they've escaped the container,
> which may suggest that other non-zombie processes can also do the same
> and that there is a security problem.
> 
> Also this fixes proc_exit_connector (which uses task_ve and runs just
> after cgroup_exit), in VE as now for "normall" processes it would now
> correctly send "exit" events to VE.
> 
> https://virtuozzo.atlassian.net/browse/ASUP-836
> https://virtuozzo.atlassian.net/browse/PSBM-160298
> Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
> ---
>   kernel/ve/ve.c | 27 +++++++++++++++++++++------
>   1 file changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
> index ff81fce4adcd0..928f4162f2bd8 100644
> --- a/kernel/ve/ve.c
> +++ b/kernel/ve/ve.c
> @@ -1131,21 +1131,36 @@ static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset)
>    *
>    * We need to make sure that this task's ->task_ve is not pointing to the
>    * container ve before putting the reference, else the use of ->task_ve from
> - * the task can lead to use-after-free (e.g. reading /proc/pid/stat).
> + * the task can lead to use-after-free (e.g. when reading /proc/pid/stat).
>    *
> - * One exception to that rule is the init process of the container, it needs
> - * ->task_ve in ve_stop_ns / ve_exit_ns, and holds a reference to ve_struct.
> + * This is not a problem for a "normal" task (basically any task inside
> + * container process tree except init), as such a "normal" task will be reaped
> + * in zap_pid_ns_processes by the init, before init releases it's own reference
> + * to ve.
> + *
> + * So we have two cases:
> + * - init is handled in ve_exit_ns;
> + * - "external" task (the task which does not resolve to a valid pid in
> + *   container pid namespace, and thus there is no guaranty that it would be
> + *   reaped in time) is handled below.
>    */
>   static void ve_exit(struct cgroup *cg, struct cgroup *oldcg, struct task_struct *task)
>   {
>   	struct ve_struct *ve = task->task_ve;
>   
>   	/*
> -	 * Pairs with synchronize_rcu in ve_grab_context and ve_drop_context
> +	 * Pairs with synchronize_rcu in ve_grab_context and ve_drop_context.
>   	 */
>   	rcu_read_lock();
> -	if (!ve->ve_ns || ve->init_task != task)
> -		rcu_assign_pointer(task->task_ve, &ve0);
> +	if (!ve->ve_ns || ve->init_task != task) {
> +		/*
> +		 * If task resolves to pid zero in ve pidns,
> +		 * it is outside ve process tree.
> +		 */
> +		if (!pid_nr_ns(rcu_dereference(task->pids[PIDTYPE_PID].pid),
> +			       ve->ve_ns->pid_ns))
> +			rcu_assign_pointer(task->task_ve, &ve0);
> +	}
>   	rcu_read_unlock();
>   }
>   

-- 
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.



More information about the Devel mailing list