[Devel] [PATCH RHEL8 COMMIT] ve/cgroup: At container start check ve's css_set for host-level cgroups
Konstantin Khorenko
khorenko at virtuozzo.com
Tue Apr 27 13:42:46 MSK 2021
The commit is pushed to "branch-rh8-4.18.0-240.1.1.vz8.5.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-240.1.1.vz8.5.21
------>
commit ece0b4a2052eb19cefa9efde0f4efefe8d3fdfe1
Author: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
Date: Wed Apr 21 18:55:04 2021 +0300
ve/cgroup: At container start check ve's css_set for host-level cgroups
cgroup_mark_ve_roots is not protected against cases when a container is
started in an invalid cgroup set configuration.
The official supported way of doing that from cgroups point of view is
as follows:
1. Create a child cgroup in "ve" cgroup hierarchy.
2. Along with "ve" create child cgroups in all other major cgroup subsystems,
mounted on the system (cpuset, blkio, etc).
3. Create a child cgroup in a special cgroup hierarchy named "systemd".
4. Add a task, that will start a container to each of the newly created cgroups
from above.
5. Now this task should write "START" to "ve.state" property of the relevant
ve cgroup.
>From the userspace it's possible to ignore the supported way and proceed
to starting a container skipping steps 2-4. In kernel code, this results
in ve receiving a root css_set which includes host-level cgroups, which in
turn leads to a variety of problems like trying to add a "release_agent"
file to a host-level cgroup which already has one, as well as trying to
remove it from host-level cgroup at container stop.
Prior to performing actions on cgroups, we should first run a quick check
that none of the host-level cgroups are present in the ve's css_set.
In the check while iterating ve's css_set we skip rootnode cgroup
because it's a special case cgroup that is present in each css_set and
will always give a false positive.
https://jira.sw.ru/browse/PSBM-123506
Signed-off-by: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
(Cherry-picked from commit 2a53a1a966f4 ("ve/cgroup: At container start
check ve's css_set for host-level cgroups."))
+++
cgroup/ve: at container start only check virtualizable cgroups.
The above commit prevented situation when the a task tried to start
container without first creating the right cgroups context for that.
The logic behind that check was:
- there is a set of cgroups that will be virtualized during container
start.
- for that these cgroups will be modified.
- the cgroup that will be chosen for modification are in starting task
css set.
- it is invalid and forbidden to modify cgroups that a located in the
root of each cgroup hierarchy.
- therefore we have to check all the css set to see if it has cgroups
with no parent (indication of root) and forbid the whole procedure
if at least some cgroup matches.
The bug in this behaviour was:
- there are cases when there are non-virtualizable cgroup mounts.
- these are named cgroups which do not have a bound cgroup subsystems
on them.
- there is one exception which is a named cgroup "systemd".
- therefore container starters do not have to make nested cgroups
for these type of non-virtualizable cgroup hierarchies.
- therefore there can be named cgroups with parent == NULL in css set
of a starting task and they will not pass the check and container
start will fail.
We fix the bug to only check those cgroups in css set, that are
virtualizable. We already have the check helper that is used a bit
later in cgroup_mark_ve_roots, so let's use it.
mFixes 105332edc47c ("ve/cgroup: At container start check ve's css_set for
host-level cgroups.")
https://jira.sw.ru/browse/PSBM-125040
Signed-off-by: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
(Cherry-picked from vz7 commit 4ee3ba25bc09 ("cgroup/ve: at container
start only check virtualizable cgroups."))
Signed-off-by: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
---
kernel/cgroup/cgroup.c | 43 +++++++++++++++++++++++++++++++++++++------
1 file changed, 37 insertions(+), 6 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e7c8cafb9c4c..500da91baec7 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1928,12 +1928,6 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
#ifdef CONFIG_VE
static inline bool is_virtualized_cgroup(struct cgroup *cgrp)
{
- /*
- * no parent means this is the host cgroup
- */
- if (!cgrp->kn->parent)
- return false;
-
#if IS_ENABLED(CONFIG_CGROUP_DEBUG)
if (cgrp->subsys[debug_cgrp_id])
return false;
@@ -1947,6 +1941,31 @@ static inline bool is_virtualized_cgroup(struct cgroup *cgrp)
return false;
}
+/*
+ * Iterate all cgroups in a given css_set and check if it is a top cgroup
+ * of it's hierarchy.
+ * rootnode should be ignored as it is always present in each css set as
+ * a placeholder for any unmounted subsystem and will give false positive.
+ */
+static inline bool css_has_host_cgroups(struct css_set *cset)
+{
+ struct cgrp_cset_link *link;
+
+ lockdep_assert_held(&css_set_lock);
+
+ list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
+ if (link->cgrp->root == &cgrp_dfl_root)
+ continue;
+
+ if (!is_virtualized_cgroup(link->cgrp))
+ continue;
+
+ if (!link->cgrp->kn->parent)
+ return true;
+ }
+ return false;
+}
+
int cgroup_mark_ve_roots(struct ve_struct *ve)
{
int err;
@@ -1980,6 +1999,18 @@ int cgroup_mark_ve_roots(struct ve_struct *ve)
return -ENODEV;
}
+ /*
+ * Return early if we know that this procedure will fail due to
+ * existing root cgroups which are not allowed to be root's in ve's
+ * context. This is for the case when some task wants to start VE
+ * without adding itself to all virtualized subgroups (+systemd) first.
+ */
+ if (css_has_host_cgroups(cset)) {
+ rcu_read_unlock();
+ spin_unlock_irq(&css_set_lock);
+ return -EINVAL;
+ }
+
list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
cgrp = link->cgrp;
More information about the Devel
mailing list