[Devel] [PATCH RHEL7 COMMIT] ms/mm: memcontrol: fix cgroup creation failure after many small jobs

Wed Jul 5 18:37:01 MSK 2023

The commit is pushed to "branch-rh7-3.10.0-1160.90.1.vz7.200.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-1160.90.1.vz7.200.4
------>
commit 9dcef96ce3e2e10423478fcec6338da07c446b16
Author: Johannes Weiner <hannes at cmpxchg.org>
Date:   Wed Jul 5 14:39:45 2023 +0800

    ms/mm: memcontrol: fix cgroup creation failure after many small jobs
    
    The memory controller has quite a bit of state that usually outlives the
    cgroup and pins its CSS until said state disappears.  At the same time
    it imposes a 16-bit limit on the CSS ID space to economically store IDs
    in the wild.  Consequently, when we use cgroups to contain frequent but
    small and short-lived jobs that leave behind some page cache, we quickly
    run into the 64k limitations of outstanding CSSs.  Creating a new cgroup
    fails with -ENOSPC while there are only a few, or even no user-visible
    cgroups in existence.
    
    Although pinning CSSs past cgroup removal is common, there are only two
    instances that actually need an ID after a cgroup is deleted: cache
    shadow entries and swapout records.
    
    Cache shadow entries reference the ID weakly and can deal with the CSS
    having disappeared when it's looked up later.  They pose no hurdle.
    
    Swap-out records do need to pin the css to hierarchically attribute
    swapins after the cgroup has been deleted; though the only pages that
    remain swapped out after offlining are tmpfs/shmem pages.  And those
    references are under the user's control, so they are manageable.
    
    This patch introduces a private 16-bit memcg ID and switches swap and
    cache shadow entries over to using that.  This ID can then be recycled
    after offlining when the CSS remains pinned only by objects that don't
    specifically need it.
    
    This script demonstrates the problem by faulting one cache page in a new
    cgroup and deleting it again:
    
      set -e
      mkdir -p pages
      for x in `seq 128000`; do
        [ $((x % 1000)) -eq 0 ] && echo $x
        mkdir /cgroup/foo
        echo $$ >/cgroup/foo/cgroup.procs
        echo trex >pages/$x
        echo $$ >/cgroup/cgroup.procs
        rmdir /cgroup/foo
      done
    
    When run on an unpatched kernel, we eventually run out of possible IDs
    even though there are no visible cgroups:
    
      [root at ham ~]# ./cssidstress.sh
      [...]
      65000
      mkdir: cannot create directory '/cgroup/foo': No space left on device
    
    After this patch, the IDs get released upon cgroup destruction and the
    cache and css objects get released once memory reclaim kicks in.
    
    [hannes at cmpxchg.org: init the IDR]
      Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
    Fixes: b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups")
    Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes at cmpxchg.org>
    Reported-by: John Garcia <john.garcia at mesosphere.io>
    Reviewed-by: Vladimir Davydov <vdavydov at virtuozzo.com>
    Acked-by: Tejun Heo <tj at kernel.org>
    Cc: Nikolay Borisov <kernel at kyup.com>
    Cc: <stable at vger.kernel.org>    [3.19+]
    Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
    
    Changes:
     - rhel almost ported it, so some things are already there
     - remove synchronize_rcu() from rhel version
     - skip mem_cgroup_try_charge_swap hunk
     - in original patch we didn't have if (!cont->parent) return 0; thing in
       mem_cgroup_css_online, we should handle memcg->id.id before this not to break
       refcounting on cgroups without parent (e.g. root_mem_cgroup)
    
    https://jira.vzint.dev/browse/PSBM-147036
    
    (cherry picked from commit 73f576c04b9410ed19660f74f97521bee6e1c546)
    Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
    
    =================
    Patchset description:
    memcg: release id when offlinging cgroup
    
    We see that container user can deplete memory cgroup ids on the system
    (64k) and prevent further memory cgroup creation. In crash collected by
    our customer in such a situation we see that mem_cgroup_idr is full of
    cgroups from one container with same exact path (cgroup of docker
    service), cgroups are not released because they have kmem charges, this
    kmem charge is for a tmpfs dentry allocated from this cgroup. (And on
    vz7 kernel it seems that such a dentry is only released after umounting
    tmpfs or removing the corresponding file from tmpfs.)
    
    So there is a valid way to hold kmem cgroup for a long time. Similar
    thing was mentioned in mainstream with page cache holding kmem cgroup
    for a long time. And they proposed a way to deal with it - just release
    cgroup id early so that one can allocate new cgroups immediately.
    
    Reproduce:
    https://git.vzint.dev/users/ptikhomirov/repos/helpers/browse/memcg-related/test-mycg-tmpfs.sh
    
    After this fix the number of memory cgroups in /proc/cgroups can now
    show > 64k as we allow to leave memory cgroups hanging while releasing
    their ids.
    
    Note: Maybe it's a bad idea to allow container to eat kernel
    memory with such a hanging cgroups, but yet I don't have better ideas.
    
    https://jira.vzint.dev/browse/PSBM-147473
    https://jira.vzint.dev/browse/PSBM-147036
---
 mm/memcontrol.c | 62 ++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 20 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 99e0dbfe8f77..02dddc4c8cb8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -176,6 +176,11 @@ enum mem_cgroup_events_target {
 #define SOFTLIMIT_EVENTS_TARGET 1024
 #define NUMAINFO_EVENTS_TARGET	1024
 
+struct mem_cgroup_id {
+	int id;
+	atomic_t ref;
+};
+
 static void mem_cgroup_id_put(struct mem_cgroup *memcg);
 
 struct mem_cgroup_stat_cpu {
@@ -302,7 +307,7 @@ struct mem_cgroup {
 	struct cgroup_subsys_state css;
 
 	/* Private memcg ID. Used to ID objects that outlive the cgroup */
-	unsigned short id;
+	struct mem_cgroup_id id;
 
 	/*
 	 * the counter to account for memory usage
@@ -6662,14 +6667,24 @@ unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	if (mem_cgroup_disabled())
 		return 0;
-	return memcg->id;
+
+	return memcg->id.id;
+}
+
+static void mem_cgroup_id_get(struct mem_cgroup *memcg)
+{
+	atomic_inc(&memcg->id.ref);
 }
 
 static void mem_cgroup_id_put(struct mem_cgroup *memcg)
 {
-	idr_remove(&mem_cgroup_idr, memcg->id);
-	memcg->id = 0;
-	synchronize_rcu();
+	if (atomic_dec_and_test(&memcg->id.ref)) {
+		idr_remove(&mem_cgroup_idr, memcg->id.id);
+		memcg->id.id = 0;
+
+		/* Memcg ID pins CSS */
+		css_put(&memcg->css);
+	}
 }
 
 /**
@@ -6723,7 +6738,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 {
 	struct mem_cgroup *memcg;
 	size_t size;
-	int id;
 	int i, ret;
 	int node;
 
@@ -6734,14 +6748,12 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	if (!memcg)
 		return NULL;
 
-	id = idr_alloc(&mem_cgroup_idr, NULL,
+	memcg->id.id = idr_alloc(&mem_cgroup_idr, NULL,
 		       1, MEM_CGROUP_ID_MAX,
 		       GFP_KERNEL);
-	if (id < 0)
+	if (memcg->id.id < 0)
 		goto fail;
 
-	memcg->id = id;
-
 	memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
 	if (!memcg->stat)
 		goto out_free;
@@ -6756,8 +6768,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 			goto out_pcpu_free;
 	}
 	spin_lock_init(&memcg->pcp_counter_lock);
-	idr_replace(&mem_cgroup_idr, memcg, memcg->id);
-	synchronize_rcu();
+	idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
 	return memcg;
 
 out_pcpu_free:
@@ -6769,10 +6780,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	for_each_node(node)
 		free_mem_cgroup_per_zone_info(memcg, node);
 
-	if (memcg->id > 0) {
-		idr_remove(&mem_cgroup_idr, memcg->id);
-		synchronize_rcu();
-	}
+	if (memcg->id.id > 0)
+		idr_remove(&mem_cgroup_idr, memcg->id.id);
 fail:
 	kfree(memcg);
 	return NULL;
@@ -6915,11 +6924,18 @@ mem_cgroup_css_online(struct cgroup *cont)
 {
 	struct mem_cgroup *memcg, *parent;
 
-	if (!cont->parent)
-		return 0;
-
 	mutex_lock(&memcg_create_mutex);
 	memcg = mem_cgroup_from_cont(cont);
+
+	/* Online state pins memcg ID, memcg ID pins CSS */
+	mem_cgroup_id_get(memcg);
+	css_get(&memcg->css);
+
+	if (!cont->parent) {
+		mutex_unlock(&memcg_create_mutex);
+		return 0;
+	}
+
 	parent = mem_cgroup_from_cont(cont->parent);
 
 	memcg->use_hierarchy = parent->use_hierarchy;
@@ -7035,6 +7051,8 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
 	 * no longer iterate over it.
 	 */
 	release_oom_context(&memcg->oom_ctx);
+
+	mem_cgroup_id_put(memcg);
 }
 
 static void mem_cgroup_css_free(struct cgroup *cont)
@@ -7757,6 +7775,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
 	memcg = pc->mem_cgroup;
 
+	mem_cgroup_id_get(memcg);
 	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
 	VM_BUG_ON_PAGE(oldid, page);
 	mem_cgroup_swap_statistics(memcg, true);
@@ -7772,6 +7791,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 
 	mem_cgroup_charge_statistics(memcg, page, -1);
 	memcg_check_events(memcg, page);
+
+	if (!mem_cgroup_is_root(memcg))
+		css_put(&memcg->css);
 }
 
 /**
@@ -7796,7 +7818,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
 			page_counter_uncharge(&memcg->memsw, 1);
 		mem_cgroup_swap_statistics(memcg, false);
 		this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PSWPIN]);
-		css_put(&memcg->css);
+		mem_cgroup_id_put(memcg);
 	}
 	rcu_read_unlock();
 }