[Devel] [PATCH RHEL8 COMMIT] cgroup/ve: Move ve workqueue stop to ve_stop_ns()

Tue Apr 20 19:35:01 MSK 2021

The commit is pushed to "branch-rh8-4.18.0-240.1.1.vz8.5.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-240.1.1.vz8.5.18
------>
commit db899bd69d325860191ea34b91ff60d25ca56f59
Author: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
Date:   Tue Apr 20 19:35:01 2021 +0300

    cgroup/ve: Move ve workqueue stop to ve_stop_ns()
    
    Fixes: a01c704c0c3f ("cgroup/ve: Do not run release_agent on non-running ve")
    
    The above commit fixed the bug which can be described as follows:
    - It is possible that functions ve_stop_ns and cgroup1_release_agent will
      be executed in parallel.
    - cgroup1_release_agent, depends on ve->is_running, which order-depends on
      umh_worker state (working/stopped). If it sees that ve->is_running then
      it will spawn a userspace process.
    - this logic can easily fail as ve_stop_ns can set is_running to false and
      stop umh_worker in the scheduling window of cgroup1_release_agent.
      cgroup1_release_agent has read ve->is_running as true and went to a long
      sleep, during which ve_stop_ns worked and set ve->is_running to 0 and
      stopped umh_worker as well. When cgroup1_relase_agent wakes up, it
      doesn't not know is_running is already 0 and will just try to spawn a
      new userspace program via umh_worker.
    - The stopped umh_worker has no way to signal to te caller that it will not
      exectue the task. Instead it just hungs.
    - The fix then would be to eliminate data race, between ve_stop_ns and
      cgroup1_release_agent by locking ve->op_sem for the whole period of
      checking ve->is_running flag and subsequent spawn of userpace process via
      still running umh_worker. This way ve_stop_ns will not be able to proceed
      past ve->is_running = 0 to stop the umh_worker until the userspace
      process is terminated.
    
    The fix has resolved the data race between ve_stop_ns and
    cgroup1_release_agent. Unfortunately it has also introduced a new deadlock:
     - Suppose that ve_stop_ns has already finished and returned and now
       ve_exit_ns starts.
     - First ve_exit_ns will acquire ve->op_sem.
     - Then ve_exit_ns will call cgroup_unmark_ve_roots. Inside of
       cgroup_unmark_ve_roots ve_workqueue will be flushed.
     - Workqueue flush is a blocking operation which will block with ve->op_sem
       acquired.
     - Also workqueue flush will block until all cgroup1_release_agent work is
       done.
     - cgroup1_release_agent will not finish because it will also try to
       acquire ve->op_sem.
     - We have a deadlock between ve_exit_ns and cgroup1_release_agent now.
    
    How do we fix that so that both ve_stop_ns and ve_exit_ns did not have
    synchronization issues?
    We should not have ve workqueue working during ve_exit_ns. It has no way of
    spawning userspace processes anyways since ve_stop_ns has stopped
    umh_worker.  Ideally, ve workqueue should be stopped before umh_worker.
    That's why we move ve workqueue stopping code to ve_stop_ns after
    ve->is_running = 0, but before ve_stop_umh.
    
    Now because ve workqueue is guaranteed to work only when umh_worker also
    works, we can remove 'if (ve-is_running)' check from cgroup1_release_agent
    togather with ve->op_sem locking code.
    
    One issue left to discuss here is that the newly emptied cgroups will try to
    add themselves into the release workqueue via ve_add_to_release_list. But for
    now it seems sufficient to rely on ve->is_running to stop adding to
    release_list after is_running is 0.
    To synchronize ve->is_running and destruction of workqueue, we add
    synchronize_rcu right after is_running is set to zero. This way all possible
    tasks that see is_running == 1, will be forced to scheduled out from
    ve_add_to_release_list before stop_container code continues to workqueue
    destruction. The needed schedule out will happen at rcu_read_unlock which is
    already a safe place as the release work is already in the workqueue by this
    time.
    
    What's left is to remove flush_workqueue code from unmark_ve_roots, because
    it's too late to do the flush when workqueue is already stopped long ago.
    
    This way we get this FIXED sequence of actions at container stop:
    - Container is being stopped and we get into ve_stop_ns
    - ve_stop_ns sets is_running = 0
    - ve_stop_ns calls synchronize_rcu
---
 kernel/cgroup/cgroup-v1.c |  6 ------
 kernel/cgroup/cgroup.c    |  9 ---------
 kernel/ve/ve.c            | 23 +++++++++++++++++++++--
 3 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index cd1a0df6c528..f6ef1f45383f 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -936,10 +936,6 @@ void cgroup1_release_agent(struct work_struct *work)
 
 		mutex_unlock(&cgroup_mutex);
 
-		down_write(&ve->op_sem);
-		if (!ve->is_running)
-			goto continue_with_mutex;
-
 		err = call_usermodehelper_ve(ve, argv[0], argv,
 			envp, UMH_WAIT_EXEC);
 
@@ -947,8 +943,6 @@ void cgroup1_release_agent(struct work_struct *work)
 			pr_warn_ratelimited("cgroup1_release_agent "
 					    "%s %s failed: %d\n",
 					    agentbuf, pathbuf, err);
-continue_with_mutex:
-		up_write(&ve->op_sem);
 		mutex_lock(&cgroup_mutex);
 continue_free:
 		kfree(pathbuf);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 779a71bdbaef..b02c88063a27 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2072,15 +2072,6 @@ void cgroup_unmark_ve_roots(struct ve_struct *ve)
 	}
 	/* ve_owner == NULL will be visible */
 	synchronize_rcu();
-
-	/*
-	 * Anyone already waiting in this wq to execute
-	 * cgroup_release_agent doesn't know that ve_owner is NULL,
-	 * but we can wait for all of them at flush_workqueue.
-	 * After it is complete no other cgroup can seep through
-	 * to this ve's workqueue, so it's safe to shutdown ve.
-	 */
-	flush_workqueue(ve->wq);
 }
 
 struct cgroup *cgroup_get_ve_root1(struct cgroup *cgrp)
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 401f5e870394..f164e2d426a6 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -531,8 +531,17 @@ static int ve_workqueue_start(struct ve_struct *ve)
 static void ve_workqueue_stop(struct ve_struct *ve)
 {
 	destroy_workqueue(ve->wq);
+	ve->wq = NULL;
 }
 
+/*
+ * ve_add_to_release_list - called from cgroup1_check_for_release to put a
+ * cgroup into a release workqueue. There are two codepaths that lead to this
+ * function. One starts from cgroup_exit() which holds css_set_lock, another
+ * one from cgroup_destroy_locked which does not hold css_set_lock. So we
+ * should not use any reschedulable
+ *
+ */
 void ve_add_to_release_list(struct cgroup *cgrp)
 {
 	struct ve_struct *ve;
@@ -541,6 +550,10 @@ void ve_add_to_release_list(struct cgroup *cgrp)
 
 	rcu_read_lock();
 	ve = cgroup_get_ve_owner(cgrp);
+	if (!ve->is_running) {
+		rcu_read_unlock();
+		return;
+	}
 
 	spin_lock_irqsave(&ve->release_list_lock, flags);
 	if (!cgroup_is_dead(cgrp) &&
@@ -709,6 +722,14 @@ void ve_stop_ns(struct pid_namespace *pid_ns)
 	 * ve_mutex works as barrier for ve_can_attach().
 	 */
 	ve->is_running = 0;
+	synchronize_rcu();
+
+	/*
+	 * release_agent works on top of umh_worker, so we must make sure, that
+	 * ve workqueue is stopped first.
+	 */
+	ve_workqueue_stop(ve);
+
 	/*
 	 * Neither it can be in pseudosuper state
 	 * anymore, setup it again if needed.
@@ -744,8 +765,6 @@ void ve_exit_ns(struct pid_namespace *pid_ns)
 
 	cgroup_unmark_ve_roots(ve);
 
-	ve_workqueue_stop(ve);
-
 	ve_cleanup_per_cgroot_data(ve, NULL);
 
 	/*