[Devel] [PATCH RHEL7 COMMIT] ms/mm: memcontrol: fix swap counter leak on swapout from offline cgroup

Wed Jul 5 18:37:01 MSK 2023

The commit is pushed to "branch-rh7-3.10.0-1160.90.1.vz7.200.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-1160.90.1.vz7.200.4
------>
commit c090af3a18595ef850087d3428fea38a9346b4b3
Author: Vladimir Davydov <vdavydov at virtuozzo.com>
Date:   Wed Jul 5 14:39:46 2023 +0800

    ms/mm: memcontrol: fix swap counter leak on swapout from offline cgroup
    
    An offline memory cgroup might have anonymous memory or shmem left
    charged to it and no swap.  Since only swap entries pin the id of an
    offline cgroup, such a cgroup will have no id and so an attempt to
    swapout its anon/shmem will not store memory cgroup info in the swap
    cgroup map.  As a result, memcg->swap or memcg->memsw will never get
    uncharged from it and any of its ascendants.
    
    Fix this by always charging swapout to the first ancestor cgroup that
    hasn't released its id yet.
    
    [hannes at cmpxchg.org: add comment to mem_cgroup_swapout]
    [vdavydov at virtuozzo.com: use WARN_ON_ONCE() in mem_cgroup_id_get_online()]
      Link: http://lkml.kernel.org/r/20160803123445.GJ13263@esperanza
    mFixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
    Link: http://lkml.kernel.org/r/5336daa5c9a32e776067773d9da655d2dc126491.1470219853.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov <vdavydov at virtuozzo.com>
    
    Acked-by: Johannes Weiner <hannes at cmpxchg.org>
    Acked-by: Michal Hocko <mhocko at suse.com>
    Cc: <stable at vger.kernel.org>    [3.19+]
    Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
    
    https://jira.vzint.dev/browse/PSBM-147036
    
    (cherry picked from commit 1f47b61fb4077936465dcde872a4e5cc4fe708da)
    Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
    
    =================
    Patchset description:
    memcg: release id when offlinging cgroup
    
    We see that container user can deplete memory cgroup ids on the system
    (64k) and prevent further memory cgroup creation. In crash collected by
    our customer in such a situation we see that mem_cgroup_idr is full of
    cgroups from one container with same exact path (cgroup of docker
    service), cgroups are not released because they have kmem charges, this
    kmem charge is for a tmpfs dentry allocated from this cgroup. (And on
    vz7 kernel it seems that such a dentry is only released after umounting
    tmpfs or removing the corresponding file from tmpfs.)
    
    So there is a valid way to hold kmem cgroup for a long time. Similar
    thing was mentioned in mainstream with page cache holding kmem cgroup
    for a long time. And they proposed a way to deal with it - just release
    cgroup id early so that one can allocate new cgroups immediately.
    
    Reproduce:
    https://git.vzint.dev/users/ptikhomirov/repos/helpers/browse/memcg-related/test-mycg-tmpfs.sh
    
    After this fix the number of memory cgroups in /proc/cgroups can now
    show > 64k as we allow to leave memory cgroups hanging while releasing
    their ids.
    
    Note: Maybe it's a bad idea to allow container to eat kernel
    memory with such a hanging cgroups, but yet I don't have better ideas.
    
    https://jira.vzint.dev/browse/PSBM-147473
    https://jira.vzint.dev/browse/PSBM-147036
---
 mm/memcontrol.c | 39 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02dddc4c8cb8..d106e8f64997 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6676,6 +6676,24 @@ static void mem_cgroup_id_get(struct mem_cgroup *memcg)
 	atomic_inc(&memcg->id.ref);
 }
 
+static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
+{
+	while (!atomic_inc_not_zero(&memcg->id.ref)) {
+		/*
+		 * The root cgroup cannot be destroyed, so it's refcount must
+		 * always be >= 1.
+		 */
+		if (WARN_ON_ONCE(memcg == root_mem_cgroup)) {
+			VM_BUG_ON(1);
+			break;
+		}
+		memcg = parent_mem_cgroup(memcg);
+		if (!memcg)
+			memcg = root_mem_cgroup;
+	}
+	return memcg;
+}
+
 static void mem_cgroup_id_put(struct mem_cgroup *memcg)
 {
 	if (atomic_dec_and_test(&memcg->id.ref)) {
@@ -7756,7 +7774,7 @@ static void __init enable_swap_cgroup(void)
  */
 void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 {
-	struct mem_cgroup *memcg;
+	struct mem_cgroup *memcg, *swap_memcg;
 	struct page_cgroup *pc;
 	unsigned short oldid;
 
@@ -7775,17 +7793,28 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
 	memcg = pc->mem_cgroup;
 
-	mem_cgroup_id_get(memcg);
-	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
+	/*
+	 * In case the memcg owning these pages has been offlined and doesn't
+	 * have an ID allocated to it anymore, charge the closest online
+	 * ancestor for the swap instead and transfer the memory+swap charge.
+	 */
+	swap_memcg = mem_cgroup_id_get_online(memcg);
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg));
 	VM_BUG_ON_PAGE(oldid, page);
-	mem_cgroup_swap_statistics(memcg, true);
-	this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PSWPOUT]);
+	mem_cgroup_swap_statistics(swap_memcg, true);
+	this_cpu_inc(swap_memcg->stat->events[MEM_CGROUP_EVENTS_PSWPOUT]);
 
 	pc->flags = 0;
 
 	if (!mem_cgroup_is_root(memcg))
 		page_counter_uncharge(&memcg->memory, 1);
 
+	if (memcg != swap_memcg) {
+		if (!mem_cgroup_is_root(swap_memcg))
+			page_counter_charge(&swap_memcg->memsw, 1);
+		page_counter_uncharge(&memcg->memsw, 1);
+	}
+
 	/* XXX: caller holds IRQ-safe mapping->tree_lock */
 	VM_BUG_ON(!irqs_disabled());