[Devel] [PATCH RHEL8 COMMIT] ve/cgroup: At container start check ve's css_set for host-level cgroups

Konstantin Khorenko khorenko at virtuozzo.com
Tue Apr 27 13:42:46 MSK 2021


The commit is pushed to "branch-rh8-4.18.0-240.1.1.vz8.5.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-240.1.1.vz8.5.21
------>
commit ece0b4a2052eb19cefa9efde0f4efefe8d3fdfe1
Author: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
Date:   Wed Apr 21 18:55:04 2021 +0300

    ve/cgroup: At container start check ve's css_set for host-level cgroups
    
    cgroup_mark_ve_roots is not protected against cases when a container is
    started in an invalid cgroup set configuration.
    The official supported way of doing that from cgroups point of view is
    as follows:
    1. Create a child cgroup in "ve" cgroup hierarchy.
    2. Along with "ve" create child cgroups in all other major cgroup subsystems,
    mounted on the system (cpuset, blkio, etc).
    3. Create a child cgroup in a special cgroup hierarchy named "systemd".
    4. Add a task, that will start a container to each of the newly created cgroups
    from above.
    5. Now this task should write "START" to "ve.state" property of the relevant
    ve cgroup.
    
    >From the userspace it's possible to ignore the supported way and proceed
    to starting a container skipping steps 2-4. In kernel code, this results
    in ve receiving a root css_set which includes host-level cgroups, which in
    turn leads to a variety of problems like trying to add a "release_agent"
    file to a host-level cgroup which already has one, as well as trying to
    remove it from host-level cgroup at container stop.
    
    Prior to performing actions on cgroups, we should first run a quick check
    that none of the host-level cgroups are present in the ve's css_set.
    In the check while iterating ve's css_set we skip rootnode cgroup
    because it's a special case cgroup that is present in each css_set and
    will always give a false positive.
    
    https://jira.sw.ru/browse/PSBM-123506
    Signed-off-by: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
    
    (Cherry-picked from commit 2a53a1a966f4 ("ve/cgroup: At container start
    check ve's css_set for host-level cgroups."))
    
    +++
    cgroup/ve: at container start only check virtualizable cgroups.
    
    The above commit prevented situation when the a task tried to start
    container without first creating the right cgroups context for that.
    
    The logic behind that check was:
    - there is a set of cgroups that will be virtualized during container
      start.
    - for that these cgroups will be modified.
    - the cgroup that will be chosen for modification are in starting task
      css set.
    - it is invalid and forbidden to modify cgroups that a located in the
      root of each cgroup hierarchy.
    - therefore we have to check all the css set to see if it has cgroups
      with no parent (indication of root) and forbid the whole procedure
      if at least some cgroup matches.
    
    The bug in this behaviour was:
    - there are cases when there are non-virtualizable cgroup mounts.
    - these are named cgroups which do not have a bound cgroup subsystems
      on them.
    - there is one exception which is a named cgroup "systemd".
    - therefore container starters do not have to make nested cgroups
      for these type of non-virtualizable cgroup hierarchies.
    - therefore there can be named cgroups with parent == NULL in css set
      of a starting task and they will not pass the check and container
      start will fail.
    
    We fix the bug to only check those cgroups in css set, that are
    virtualizable. We already have the check helper that is used a bit
    later in cgroup_mark_ve_roots, so let's use it.
    
    mFixes 105332edc47c ("ve/cgroup: At container start check ve's css_set for
    host-level cgroups.")
    https://jira.sw.ru/browse/PSBM-125040
    Signed-off-by: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
    
    (Cherry-picked from vz7 commit 4ee3ba25bc09 ("cgroup/ve: at container
    start only check virtualizable cgroups."))
    Signed-off-by: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
---
 kernel/cgroup/cgroup.c | 43 +++++++++++++++++++++++++++++++++++++------
 1 file changed, 37 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e7c8cafb9c4c..500da91baec7 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1928,12 +1928,6 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
 #ifdef CONFIG_VE
 static inline bool is_virtualized_cgroup(struct cgroup *cgrp)
 {
-	/*
-	 * no parent means this is the host cgroup
-	 */
-	if (!cgrp->kn->parent)
-		return false;
-
 #if IS_ENABLED(CONFIG_CGROUP_DEBUG)
 	if (cgrp->subsys[debug_cgrp_id])
 		return false;
@@ -1947,6 +1941,31 @@ static inline bool is_virtualized_cgroup(struct cgroup *cgrp)
 	return false;
 }
 
+/*
+ * Iterate all cgroups in a given css_set and check if it is a top cgroup
+ * of it's hierarchy.
+ * rootnode should be ignored as it is always present in each css set as
+ * a placeholder for any unmounted subsystem and will give false positive.
+ */
+static inline bool css_has_host_cgroups(struct css_set *cset)
+{
+	struct cgrp_cset_link *link;
+
+	lockdep_assert_held(&css_set_lock);
+
+	list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
+		if (link->cgrp->root == &cgrp_dfl_root)
+			continue;
+
+		if (!is_virtualized_cgroup(link->cgrp))
+			continue;
+
+		if (!link->cgrp->kn->parent)
+			return true;
+	}
+	return false;
+}
+
 int cgroup_mark_ve_roots(struct ve_struct *ve)
 {
 	int err;
@@ -1980,6 +1999,18 @@ int cgroup_mark_ve_roots(struct ve_struct *ve)
 		return -ENODEV;
 	}
 
+	/*
+	 * Return early if we know that this procedure will fail due to
+	 * existing root cgroups which are not allowed to be root's in ve's
+	 * context. This is for the case when some task wants to start VE
+	 * without adding itself to all virtualized subgroups (+systemd) first.
+	 */
+	if (css_has_host_cgroups(cset)) {
+		rcu_read_unlock();
+		spin_unlock_irq(&css_set_lock);
+		return -EINVAL;
+	}
+
 	list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
 		cgrp = link->cgrp;
 


More information about the Devel mailing list