[Devel] [PATCH vz8] cgroup/ve: moved ve workqueue stop to ve_stop_ns
Kirill Tkhai
ktkhai at virtuozzo.com
Mon Apr 19 10:54:34 MSK 2021
On 07.04.2021 00:18, Valeriy Vdovin wrote:
> Fixes: a01c704c0c3f21ec3e3e7540e91682f44f4065cd
> The above commit fixed the bug which can be described as follows:
> - It is possible that functions ve_stop_ns and cgroup1_release_agent will
> be executed in parallel.
> - cgroup1_release_agent, depends on ve->is_running, which order-depends on
> umh_worker state (working/stopped). If it sees that ve->is_running then
> it will spawn a userspace process.
> - this logic can easily fail as ve_stop_ns can set is_running to false and
> stop umh_worker in the scheduling window of cgroup1_release_agent.
> cgroup1_release_agent has read ve->is_running as true and went to a long
> sleep, during which ve_stop_ns worked and set ve->is_running to 0 and
> stopped umh_worker as well. When cgroup1_relase_agent wakes up, it
> doesn't not know is_running is already 0 and will just try to spawn a
> new userspace program via umh_worker.
> - The stopped umh_worker has no way to signal to te caller that it will not
> exectue the task. Instead it just hungs.
> - The fix then would be to eliminate data race, between ve_stop_ns and
> cgroup1_release_agent by locking ve->op_sem for the whole period of
> checking ve->is_running flag and subsequent spawn of userpace process via
> still running umh_worker. This way ve_stop_ns will not be able to proceed
> past ve->is_running = 0 to stop the umh_worker until the userspace
> process is terminated.
>
> The fix has resolved the data race between ve_stop_ns andve_workqueue_stop
> cgroup1_release_agent. Unfortunately it has also introduced a new deadlock:
> - Suppose that ve_stop_ns has already finished and returned and now
> ve_exit_ns starts.
> - First ve_exit_ns will acquire ve->op_sem.
> - Then ve_exit_ns will call cgroup_unmark_ve_roots. Inside of
> cgroup_unmark_ve_roots ve_workqueue will be flushed.
> - Workqueue flush is a blocking operation which will block with ve->op_sem
> acquired.
> - Also workqueue flush will block until all cgroup1_release_agent work is
> done.
> - cgroup1_release_agent will not finish because it will also try to
> acquire ve->op_sem.
> - We have a deadlock between ve_exit_ns and cgroup1_release_agent now.
>
> How do we fix that so that both ve_stop_ns and ve_exit_ns did not have
> synchronization issues?
> We should not have ve workqueue working during ve_exit_ns. It has no way of
> spawning userspace processes anyways since ve_stop_ns has stopped
> umh_worker. Ideally, ve workqueue should be stopped before umh_worker.
> That's why we move ve workqueue stopping code to ve_stop_ns after
> ve->is_running = 0, but before ve_stop_umh.
>
> Now because ve workqueue is guaranteed to work only when umh_worker also
> works, we can remove 'if (ve-is_running)' check from cgroup1_release_agent
> togather with ve->op_sem locking code.
>
> One issue left is that the newly emptied cgroups can still get into the
> workqueue via ve_add_to_release_list. This is a big issue because following
> ve_stop_ns many container processes will start dying and their cgroups will
> become empty and they will be added to ve->release_list.
> cgroup1_release_agent among all was responsible to remove the cgroups from
> this list at ve_exit_ns and now it's impossible to do this cleanup.
> To fix this, we need a flag release_list_allow that will be set to false in
> ve_stop_ns, so that ve_add_to_release_list could say no to all cgroups that
> try to add after ve_stop_ns.
>
> What's left is to remove flush_workqueue code from unmark_ve_roots, because
> it's too late to do the flush when workqueue is already stopped long ago.
>
> https://jira.sw.ru/browse/PSBM-127457
> Signed-off-by: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
> ---
> include/linux/ve.h | 1 +
> kernel/cgroup/cgroup-v1.c | 6 ------
> kernel/cgroup/cgroup.c | 9 ---------
> kernel/ve/ve.c | 24 ++++++++++++++++++++++--
> 4 files changed, 23 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/ve.h b/include/linux/ve.h
> index 3b487f8a4a50..041538e11851 100644
> --- a/include/linux/ve.h
> +++ b/include/linux/ve.h
> @@ -109,6 +109,7 @@ struct ve_struct {
> * cgroups, that want to notify about becoming
> * empty, are linked to this release_list.
> */
> + bool release_list_allow;
> struct list_head release_list;
> spinlock_t release_list_lock;
>
> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
> index cd1a0df6c528..f6ef1f45383f 100644
> --- a/kernel/cgroup/cgroup-v1.c
> +++ b/kernel/cgroup/cgroup-v1.c
> @@ -936,10 +936,6 @@ void cgroup1_release_agent(struct work_struct *work)
>
> mutex_unlock(&cgroup_mutex);
>
> - down_write(&ve->op_sem);
> - if (!ve->is_running)
> - goto continue_with_mutex;
> -
> err = call_usermodehelper_ve(ve, argv[0], argv,
> envp, UMH_WAIT_EXEC);
>
> @@ -947,8 +943,6 @@ void cgroup1_release_agent(struct work_struct *work)
> pr_warn_ratelimited("cgroup1_release_agent "
> "%s %s failed: %d\n",
> agentbuf, pathbuf, err);
> -continue_with_mutex:
> - up_write(&ve->op_sem);
> mutex_lock(&cgroup_mutex);
> continue_free:
> kfree(pathbuf);
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 779a71bdbaef..b02c88063a27 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2072,15 +2072,6 @@ void cgroup_unmark_ve_roots(struct ve_struct *ve)
> }
> /* ve_owner == NULL will be visible */
> synchronize_rcu();
> -
> - /*
> - * Anyone already waiting in this wq to execute
> - * cgroup_release_agent doesn't know that ve_owner is NULL,
> - * but we can wait for all of them at flush_workqueue.
> - * After it is complete no other cgroup can seep through
> - * to this ve's workqueue, so it's safe to shutdown ve.
> - */
> - flush_workqueue(ve->wq);
> }
>
> struct cgroup *cgroup_get_ve_root1(struct cgroup *cgrp)
> diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
> index 031b104075c8..15f511fe5e0c 100644
> --- a/kernel/ve/ve.c
> +++ b/kernel/ve/ve.c
> @@ -81,6 +81,7 @@ struct ve_struct ve0 = {
> .release_list = LIST_HEAD_INIT(ve0.release_list),
> .release_agent_work = __WORK_INITIALIZER(ve0.release_agent_work,
> cgroup1_release_agent),
> + .release_list_allow = true,
> .per_cgroot_list = LIST_HEAD_INIT(ve0.per_cgroot_list),
> .per_cgroot_list_lock = __SPIN_LOCK_UNLOCKED(
> ve0.per_cgroot_list_lock),
> @@ -533,8 +534,17 @@ static int ve_workqueue_start(struct ve_struct *ve)
> static void ve_workqueue_stop(struct ve_struct *ve)
> {
> destroy_workqueue(ve->wq);
> + ve->wq = NULL;
> }
>
> +/*
> + * ve_add_to_release_list - called from cgroup1_check_for_release to put a
> + * cgroup into a release workqueue. There are two codepaths that lead to this
> + * function. One starts from cgroup_exit() which holds css_set_lock, another
> + * one from cgroup_destroy_locked which does not hold css_set_lock. So we
> + * should not use any reschedulable
> + *
> + */
> void ve_add_to_release_list(struct cgroup *cgrp)
> {
> struct ve_struct *ve;
> @@ -543,6 +553,10 @@ void ve_add_to_release_list(struct cgroup *cgrp)
>
> rcu_read_lock();
> ve = cgroup_get_ve_owner(cgrp);
> + if (!ve->is_running || !ve->release_list_allow) {
Are these check needed both? Can we use only second of them?
> + rcu_read_unlock();
> + return;
> + }
>
> spin_lock_irqsave(&ve->release_list_lock, flags);
> if (!cgroup_is_dead(cgrp) &&
> @@ -641,6 +655,7 @@ static int ve_start_container(struct ve_struct *ve)
> goto err_mark_ve;
>
> ve->is_running = 1;
> + ve->release_list_allow = true;
>
> printk(KERN_INFO "CT: %s: started\n", ve_name(ve));
>
> @@ -712,6 +727,13 @@ void ve_stop_ns(struct pid_namespace *pid_ns)
> * ve_mutex works as barrier for ve_can_attach().
> */
> ve->is_running = 0;
> +
> + /*
> + * release_agent works on top of umh_worker, so we must make sure, that
> + * ve workqueue is stopped first.
> + */
> + ve->release_list_allow = false;
Why don't we need to wait that ve_add_to_release_list(), which already
have read "ve->release_list_allow == true" here?
> + ve_workqueue_stop(ve);
> /*
> * Neither it can be in pseudosuper state
> * anymore, setup it again if needed.
> @@ -747,8 +769,6 @@ void ve_exit_ns(struct pid_namespace *pid_ns)
>
> cgroup_unmark_ve_roots(ve);
>
> - ve_workqueue_stop(ve);
> -
> ve_cleanup_per_cgroot_data(ve, NULL);
>
> /*
>
More information about the Devel
mailing list