[Devel] [PATCH VZ8 v2] cgroup/ve: moved ve workqueue stop to ve_stop_ns

Tue Apr 20 17:04:33 MSK 2021

On 20.04.2021 15:51, Valeriy Vdovin wrote:
> Fixes: a01c704c0c3f21ec3e3e7540e91682f44f4065cd
> The above commit fixed the bug which can be described as follows:
> - It is possible that functions ve_stop_ns and cgroup1_release_agent will
>   be executed in parallel.
> - cgroup1_release_agent, depends on ve->is_running, which order-depends on
>   umh_worker state (working/stopped). If it sees that ve->is_running then
>   it will spawn a userspace process.
> - this logic can easily fail as ve_stop_ns can set is_running to false and
>   stop umh_worker in the scheduling window of cgroup1_release_agent.
>   cgroup1_release_agent has read ve->is_running as true and went to a long
>   sleep, during which ve_stop_ns worked and set ve->is_running to 0 and
>   stopped umh_worker as well. When cgroup1_relase_agent wakes up, it
>   doesn't not know is_running is already 0 and will just try to spawn a
>   new userspace program via umh_worker.
> - The stopped umh_worker has no way to signal to te caller that it will not
>   exectue the task. Instead it just hungs.
> - The fix then would be to eliminate data race, between ve_stop_ns and
>   cgroup1_release_agent by locking ve->op_sem for the whole period of
>   checking ve->is_running flag and subsequent spawn of userpace process via
>   still running umh_worker. This way ve_stop_ns will not be able to proceed
>   past ve->is_running = 0 to stop the umh_worker until the userspace
>   process is terminated.
> 
> The fix has resolved the data race between ve_stop_ns and
> cgroup1_release_agent. Unfortunately it has also introduced a new deadlock:
>  - Suppose that ve_stop_ns has already finished and returned and now
>    ve_exit_ns starts.
>  - First ve_exit_ns will acquire ve->op_sem.
>  - Then ve_exit_ns will call cgroup_unmark_ve_roots. Inside of
>    cgroup_unmark_ve_roots ve_workqueue will be flushed.
>  - Workqueue flush is a blocking operation which will block with ve->op_sem
>    acquired.
>  - Also workqueue flush will block until all cgroup1_release_agent work is
>    done.
>  - cgroup1_release_agent will not finish because it will also try to
>    acquire ve->op_sem.
>  - We have a deadlock between ve_exit_ns and cgroup1_release_agent now.
> 
> How do we fix that so that both ve_stop_ns and ve_exit_ns did not have
> synchronization issues?
> We should not have ve workqueue working during ve_exit_ns. It has no way of
> spawning userspace processes anyways since ve_stop_ns has stopped
> umh_worker.  Ideally, ve workqueue should be stopped before umh_worker.
> That's why we move ve workqueue stopping code to ve_stop_ns after
> ve->is_running = 0, but before ve_stop_umh.
> 
> Now because ve workqueue is guaranteed to work only when umh_worker also
> works, we can remove 'if (ve-is_running)' check from cgroup1_release_agent
> togather with ve->op_sem locking code.
> 
> One issue left to discuss here is that the newly emptied cgroups will try to
> add themselves into the release workqueue via ve_add_to_release_list. But for
> now it seems sufficient to rely on ve->is_running to stop adding to
> release_list after is_running is 0.
> To synchronize ve->is_running and destruction of workqueue, we add
> synchronize_rcu right after is_running is set to zero. This way all possible
> tasks that see is_running == 1, will be forced to scheduled out from
> ve_add_to_release_list before stop_container code continues to workqueue
> destruction. The needed schedule out will happen at rcu_read_unlock which is
> already a safe place as the release work is already in the workqueue by this
> time.
> 
> What's left is to remove flush_workqueue code from unmark_ve_roots, because
> it's too late to do the flush when workqueue is already stopped long ago.
> 
> This way we get this FIXED sequence of actions at container stop:
> - Container is being stopped and we get into ve_stop_ns
> - ve_stop_ns sets is_running = 0
> - ve_stop_ns calls synchronize_rcu
> --- all cgroups that see is_running == 1 get into workqueue
> --- all cgroups that see is_running == 0 do not get into workqueue
> --- cgroup1_release_agent workqueue runs normally without need to check
>     is_running as long there is work to do.
> --- cgroup1_release_agent does not need to check umh_thread is running
>     because is this work is running, then umh runs as well.
> - ve_stop_ns calls ve_workqueue_stop->destroy_workqueue.
> --- cgroup1_release_agent is called in a sequence during workqueue drain
> --- all release_agent works are completed
> - ve_stop_ns calls ve_stop_umh. No workqueue is running, so it's totally
>   safe to do so.
> - ve_exit_ns calls cgroup_unmark_ve_roots which is also safe.
> 
> https://jira.sw.ru/browse/PSBM-127457
> Signed-off-by: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>

Reviewed-by: Kirill Tkhai <ktkhai at virtuozzo.com>

> ---
> v2: - removed release_list_allow field
>     - modified commit message a bit
>     - added synchronize_rcu
> 
>  kernel/cgroup/cgroup-v1.c |  6 ------
>  kernel/cgroup/cgroup.c    |  9 ---------
>  kernel/ve/ve.c            | 23 +++++++++++++++++++++--
>  3 files changed, 21 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
> index cd1a0df6c528..f6ef1f45383f 100644
> --- a/kernel/cgroup/cgroup-v1.c
> +++ b/kernel/cgroup/cgroup-v1.c
> @@ -936,10 +936,6 @@ void cgroup1_release_agent(struct work_struct *work)
>  
>  		mutex_unlock(&cgroup_mutex);
>  
> -		down_write(&ve->op_sem);
> -		if (!ve->is_running)
> -			goto continue_with_mutex;
> -
>  		err = call_usermodehelper_ve(ve, argv[0], argv,
>  			envp, UMH_WAIT_EXEC);
>  
> @@ -947,8 +943,6 @@ void cgroup1_release_agent(struct work_struct *work)
>  			pr_warn_ratelimited("cgroup1_release_agent "
>  					    "%s %s failed: %d\n",
>  					    agentbuf, pathbuf, err);
> -continue_with_mutex:
> -		up_write(&ve->op_sem);
>  		mutex_lock(&cgroup_mutex);
>  continue_free:
>  		kfree(pathbuf);
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 779a71bdbaef..b02c88063a27 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2072,15 +2072,6 @@ void cgroup_unmark_ve_roots(struct ve_struct *ve)
>  	}
>  	/* ve_owner == NULL will be visible */
>  	synchronize_rcu();
> -
> -	/*
> -	 * Anyone already waiting in this wq to execute
> -	 * cgroup_release_agent doesn't know that ve_owner is NULL,
> -	 * but we can wait for all of them at flush_workqueue.
> -	 * After it is complete no other cgroup can seep through
> -	 * to this ve's workqueue, so it's safe to shutdown ve.
> -	 */
> -	flush_workqueue(ve->wq);
>  }
>  
>  struct cgroup *cgroup_get_ve_root1(struct cgroup *cgrp)
> diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
> index 401f5e870394..f164e2d426a6 100644
> --- a/kernel/ve/ve.c
> +++ b/kernel/ve/ve.c
> @@ -531,8 +531,17 @@ static int ve_workqueue_start(struct ve_struct *ve)
>  static void ve_workqueue_stop(struct ve_struct *ve)
>  {
>  	destroy_workqueue(ve->wq);
> +	ve->wq = NULL;
>  }
>  
> +/*
> + * ve_add_to_release_list - called from cgroup1_check_for_release to put a
> + * cgroup into a release workqueue. There are two codepaths that lead to this
> + * function. One starts from cgroup_exit() which holds css_set_lock, another
> + * one from cgroup_destroy_locked which does not hold css_set_lock. So we
> + * should not use any reschedulable
> + *
> + */
>  void ve_add_to_release_list(struct cgroup *cgrp)
>  {
>  	struct ve_struct *ve;
> @@ -541,6 +550,10 @@ void ve_add_to_release_list(struct cgroup *cgrp)
>  
>  	rcu_read_lock();
>  	ve = cgroup_get_ve_owner(cgrp);
> +	if (!ve->is_running) {
> +		rcu_read_unlock();
> +		return;
> +	}
>  
>  	spin_lock_irqsave(&ve->release_list_lock, flags);
>  	if (!cgroup_is_dead(cgrp) &&
> @@ -709,6 +722,14 @@ void ve_stop_ns(struct pid_namespace *pid_ns)
>  	 * ve_mutex works as barrier for ve_can_attach().
>  	 */
>  	ve->is_running = 0;
> +	synchronize_rcu();
> +
> +	/*
> +	 * release_agent works on top of umh_worker, so we must make sure, that
> +	 * ve workqueue is stopped first.
> +	 */
> +	ve_workqueue_stop(ve);
> +
>  	/*
>  	 * Neither it can be in pseudosuper state
>  	 * anymore, setup it again if needed.
> @@ -744,8 +765,6 @@ void ve_exit_ns(struct pid_namespace *pid_ns)
>  
>  	cgroup_unmark_ve_roots(ve);
>  
> -	ve_workqueue_stop(ve);
> -
>  	ve_cleanup_per_cgroot_data(ve, NULL);
>  
>  	/*
>