[Devel] [PATCH RHEL9 COMMIT] ve/cgroup: fix sleeping call under spin_lock

Konstantin Khorenko khorenko at virtuozzo.com
Wed Oct 27 16:19:09 MSK 2021


The commit is pushed to "branch-rh9-5.14.vz9.1.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh9-5.14.0-4.vz9.10.19
------>
commit d45767c9075496422c5aa67d3f65673070365cef
Author: Alexander Mikhalitsyn <alexander.mikhalitsyn at virtuozzo.com>
Date:   Wed Oct 27 16:19:09 2021 +0300

    ve/cgroup: fix sleeping call under spin_lock
    
    We get crash on a container start:
    
    [  120.818119] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:201
    [  120.819805] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 3299, name: vzctl
    [  120.821504] INFO: lockdep is turned off.
    [  120.823206] irq event stamp: 6130
    [  120.824901] hardirqs last  enabled at (6129): [<ffffffff93e00cc2>] asm_sysvec_apic_timer_interrupt+0x12/0x20
    [  120.826710] hardirqs last disabled at (6130): [<ffffffff93d1d9f9>] _raw_spin_lock_irq+0x59/0x80
    [  120.828525] softirqs last  enabled at (6128): [<ffffffff940005e5>] __do_softirq+0x5e5/0x939
    [  120.830350] softirqs last disabled at (6119): [<ffffffff91df9d79>] irq_exit_rcu+0x1e9/0x240
    [  120.832178] CPU: 1 PID: 3299 Comm: vzctl ve: 200 Tainted: G        W        --------- ---  5.14.0-4.vz9.10.17+debug #1 10.17
    [  120.834095] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
    [  120.836022] Call Trace:
    [  120.837930]  dump_stack_lvl+0x56/0x7b
    ...
    [  120.839851]  ___might_sleep.cold.155+0x140/0x16e
    [  120.841792]  __kmalloc_track_caller+0x289/0x2f0
    [  120.843730]  kstrdup+0x2d/0x50
    [  120.845665]  __kernfs_new_node+0xaa/0x780
    ...
    [  120.861205]  kernfs_new_node+0x77/0x140
    [  120.863134]  __kernfs_create_file+0x37/0x2e0
    [  120.865064]  cgroup_add_file+0x1a4/0x4e0
    [  120.872775]  cgroup_mark_ve_roots+0x4e6/0x6a0
    [  120.874697]  ve_state_write+0x8b8/0xc50
    [  120.876630]  cgroup_file_write+0x40c/0x930
    ...
    [  120.884444]  kernfs_fop_write_iter+0x2d3/0x490
    [  120.886384]  new_sync_write+0x3b2/0x620
    [  120.893982]  vfs_write+0x4b5/0x8f0
    [  120.895769]  ksys_write+0xf1/0x1c0
    
    The problem here is that cgroup_add_file() function
    performs GPF_KERNEL allocations and we call it under spin_lock.
    
    After discussion about the problem with Pavel we
    decided to release css_set_lock spinlock. It looks
    safe because, as we can see, all cases where we remove elements
    from ->cgrp_links "protected" by cset refcnt which is at least
    one because we hold cgroup namespace connected to our VE.
    
    Fixes: ee84e05864f9a ("ve/cgroup: Set release_agent_path for root cgroups separately")
    
    https://jira.sw.ru/browse/PSBM-135190
    
    Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn at virtuozzo.com>
---
 kernel/cgroup/cgroup.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 83fa33063a94..0c4d44e497e3 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2088,6 +2088,29 @@ int cgroup_mark_ve_roots(struct ve_struct *ve)
 		return -EINVAL;
 	}
 
+	/*
+	 * We want to traverse the cset->cgrp_links list and
+	 * call cgroup_add_file() function which will call
+	 * memory allocation functions with GFP_KERNEL flag.
+	 * We can't continue to hold css_set_lock, but it's
+	 * safe to hold cgroup_mutex.
+	 *
+	 * It's safe to hold only cgroup_mutex and traverse
+	 * the cset list just because in all scenarious where
+	 * the ->cgrp_links list is getting modified:
+	 *
+	 * 1. cgroup_destroy_root
+	 * it is holding cgroup_mutex
+	 *
+	 * 2. put_css_set_locked
+	 * here we protected by !refcount_dec_and_test(&cset->refcount)
+	 * check which prevents any list modifications if someone is still
+	 * holding the refcnt to cset.
+	 * See copy_cgroup_ns() function it's taking refcnt's by get_css_set().
+	 *
+	 */
+	spin_unlock_irq(&css_set_lock);
+
 	list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
 		cgrp = link->cgrp;
 
@@ -2113,7 +2136,6 @@ int cgroup_mark_ve_roots(struct ve_struct *ve)
 	}
 
 	link_ve_root_cpu_cgroup(cset->subsys[cpu_cgrp_id]);
-	spin_unlock_irq(&css_set_lock);
 	mutex_unlock(&cgroup_mutex);
 	synchronize_rcu();
 


More information about the Devel mailing list