[Devel] [PATCH 1/2] ve/cgroup: At container start check ve's css_set for host-level cgroups.

Valeriy Vdovin valeriy.vdovin at virtuozzo.com
Wed Apr 21 18:36:00 MSK 2021


cgroup_mark_ve_roots is not protected against cases when a container is
started in an invalid cgroup set configuration.
The official supported way of doing that from cgroups point of view is as follows:
1. Create a child cgroup in "ve" cgroup hierarchy.
2. Along with "ve" create child cgroups in all other major cgroup subsystems,
mounted on the system (cpuset, blkio, etc).
3. Create a child cgroup in a special cgroup hierarchy named "systemd".
4. Add a task, that will start a container to each of the newly created cgroups
from above.
5. Now this task should write "START" to "ve.state" property of the relevant
ve cgroup.

>From the userspace it's possible to ignore the supported way and proceed
to starting a container skipping steps 2-4. In kernel code, this results
in ve receiving a root css_set which includes host-level cgroups, which in
turn leads to a variety of problems like trying to add a "release_agent"
file to a host-level cgroup which already has one, as well as trying to
remove it from host-level cgroup at container stop.

Prior to performing actions on cgroups, we should first run a quick check
that none of the host-level cgroups are present in the ve's css_set.
In the check while iterating ve's css_set we skip rootnode cgroup
because it's a special case cgroup that is present in each css_set and
will always give a false positive.

https://jira.sw.ru/browse/PSBM-123506
Signed-off-by: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
(Cherry-picked from 2a53a1a966f467dd0e015badbd15b4d18faf75bd)

+++
cgroup/ve: at container start only check virtualizable cgroups.

The above commit prevented situation when the a task tried to start
container without first creating the right cgroups context for that.

The logic behind that check was:
- there is a set of cgroups that will be virtualized during container
  start.
- for that these cgroups will be modified.
- the cgroup that will be chosen for modification are in starting task
  css set.
- it is invalid and forbidden to modify cgroups that a located in the
  root of each cgroup hierarchy.
- therefore we have to check all the css set to see if it has cgroups
  with no parent (indication of root) and forbid the whole procedure
  if at least some cgroup matches.

The bug in this behaviour was:
- there are cases when there are non-virtualizable cgroup mounts.
- these are named cgroups which do not have a bound cgroup subsystems
  on them.
- there is one exception which is a named cgroup "systemd".
- therefore container starters do not have to make nested cgroups
  for these type of non-virtualizable cgroup hierarchies.
- therefore there can be named cgroups with parent == NULL in css set
  of a starting task and they will not pass the check and container
  start will fail.

We fix the bug to only check those cgroups in css set, that are
virtualizable. We already have the check helper that is used a bit
later in cgroup_mark_ve_roots, so let's use it.

Fixes 105332edc47c ("ve/cgroup: At container start check ve's css_set for host-level cgroups.")
https://jira.sw.ru/browse/PSBM-125040
Signed-off-by: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
(Cherry-picked from 4ee3ba25bc0970c6ff659fff9362d70f1affa699)
---
 kernel/cgroup/cgroup.c | 42 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 36 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index b02c88063a27..75447685f258 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1928,12 +1928,6 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
 #ifdef CONFIG_VE
 static inline bool is_virtualized_cgroup(struct cgroup *cgrp)
 {
-	/*
-	 * no parent means this is the host cgroup
-	 */
-	if (!cgrp->kn->parent)
-		return false;
-
 	if (cgrp->root->subsys_mask)
 		return true;
 
@@ -1943,6 +1937,30 @@ static inline bool is_virtualized_cgroup(struct cgroup *cgrp)
 	return false;
 }
 
+/*
+ * Iterate all cgroups in a given css_set and check if it is a top cgroup
+ * of it's hierarchy.
+ * rootnode should be ignored as it is always present in each css set as
+ * a placeholder for any unmounted subsystem and will give false positive.
+ */
+static inline bool css_has_host_cgroups(struct css_set *cset)
+{
+	struct cgrp_cset_link *link;
+	lockdep_assert_held(&css_set_lock);
+
+	list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
+		if (link->cgrp->root == &cgrp_dfl_root)
+			continue;
+
+		if (!is_virtualized_cgroup(link->cgrp))
+			continue;
+
+		if (!link->cgrp->kn->parent)
+			return true;
+	}
+	return false;
+}
+
 int cgroup_mark_ve_roots(struct ve_struct *ve)
 {
 	int err;
@@ -1976,6 +1994,18 @@ int cgroup_mark_ve_roots(struct ve_struct *ve)
 		return -ENODEV;
 	}
 
+	/*
+	 * Return early if we know that this procedure will fail due to
+	 * existing root cgroups which are not allowed to be root's in ve's
+	 * context. This is for the case when some task wants to start VE
+	 * without adding itself to all virtualized subgroups (+systemd) first.
+	 */
+	if (css_has_host_cgroups(cset)) {
+		rcu_read_unlock();
+		spin_unlock_irq(&css_set_lock);
+		return -EINVAL;
+	}
+
 	list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
 		cgrp = link->cgrp;
 
-- 
2.27.0



More information about the Devel mailing list