[Devel] [PATCH RHEL8 COMMIT] ms/mm: list_lru: set shrinker map bit when child nr_items is not zero

Konstantin Khorenko khorenko at virtuozzo.com
Wed Jun 2 19:48:02 MSK 2021


The commit is pushed to "branch-rh8-4.18.0-240.1.1.vz8.5.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-240.1.1.vz8.5.34
------>
commit 6b969b452252e0cb70d850ec8fb3d921bfdccf5b
Author: Vasily Averin <vvs at virtuozzo.com>
Date:   Wed Jun 2 19:48:02 2021 +0300

    ms/mm: list_lru: set shrinker map bit when child nr_items is not zero
    
    When investigating a slab cache bloat problem, significant amount of
    negative dentry cache was seen, but confusingly they neither got shrunk
    by reclaimer (the host has very tight memory) nor be shrunk by dropping
    cache.  The vmcore shows there are over 14M negative dentry objects on
    lru, but tracing result shows they were even not scanned at all.
    
    Further investigation shows the memcg's vfs shrinker_map bit is not set.
    So the reclaimer or dropping cache just skip calling vfs shrinker.  So
    we have to reboot the hosts to get the memory back.
    
    I didn't manage to come up with a reproducer in test environment, and
    the problem can't be reproduced after rebooting.  But it seems there is
    race between shrinker map bit clear and reparenting by code inspection.
    The hypothesis is elaborated as below.
    
    The memcg hierarchy on our production environment looks like:
    
                    root
                   /    \
              system   user
    
    The main workloads are running under user slice's children, and it
    creates and removes memcg frequently.  So reparenting happens very often
    under user slice, but no task is under user slice directly.
    
    So with the frequent reparenting and tight memory pressure, the below
    hypothetical race condition may happen:
    
           CPU A                            CPU B
    reparent
        dst->nr_items == 0
                                     shrinker:
                                         total_objects == 0
        add src->nr_items to dst
        set_bit
                                         return SHRINK_EMPTY
                                         clear_bit
    child memcg offline
        replace child's kmemcg_id with
        parent's (in memcg_offline_kmem())
                                      list_lru_del() between shrinker runs
                                         see parent's kmemcg_id
                                         dec dst->nr_items
    reparent again
        dst->nr_items may go negative
        due to concurrent list_lru_del()
    
                                     The second run of shrinker:
                                         read nr_items without any
                                         synchronization, so it may
                                         see intermediate negative
                                         nr_items then total_objects
                                         may return 0 coincidently
    
                                         keep the bit cleared
        dst->nr_items != 0
        skip set_bit
        add scr->nr_item to dst
    
    After this point dst->nr_item may never go zero, so reparenting will not
    set shrinker_map bit anymore.  And since there is no task under user
    slice directly, so no new object will be added to its lru to set the
    shrinker map bit either.  That bit is kept cleared forever.
    
    How does list_lru_del() race with reparenting? It is because reparenting
    replaces children's kmemcg_id to parent's without protecting from
    nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually
    deleting items from child's lru, but dec'ing parent's nr_items, so the
    parent's nr_items may go negative as commit 2788cf0c401c ("memcg:
    reparent list_lrus and free kmemcg_id on css offline") says.
    
    Since it is impossible that dst->nr_items goes negative and
    src->nr_items goes zero at the same time, so it seems we could set the
    shrinker map bit iff src->nr_items != 0.  We could synchronize
    list_lru_count_one() and reparenting with nlru->lock, but it seems
    checking src->nr_items in reparenting is the simplest and avoids lock
    contention.
    
    mFixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
    Suggested-by: Roman Gushchin <guro at fb.com>
    Signed-off-by: Yang Shi <shy828301 at gmail.com>
    Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
    Reviewed-by: Roman Gushchin <guro at fb.com>
    Reviewed-by: Shakeel Butt <shakeelb at google.com>
    Acked-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    Cc: Vladimir Davydov <vdavydov.dev at gmail.com>
    Cc: <stable at vger.kernel.org>    [4.19]
    Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com
    Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
    
    (cherry picked from commit 8199be001a47 ("mm: list_lru: set shrinker map bit when
    child nr_items is not zero"))
    
    Taken from vz7 commit 96940bede65b ("ms/mm: list_lru: set shrinker map bit when
    child nr_items is not zero")
    
    Signed-off-by: Vasily Averin <vvs at virtuozzo.com>
---
 mm/list_lru.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index addcf5c1dcd9..77fcaee10ec5 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -533,7 +533,6 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
 	struct list_lru_node *nlru = &lru->node[nid];
 	int dst_idx = dst_memcg->kmemcg_id;
 	struct list_lru_one *src, *dst;
-	bool set;
 
 	/*
 	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
@@ -545,11 +544,12 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
 	dst = list_lru_from_memcg_idx(nlru, dst_idx);
 
 	list_splice_init(&src->list, &dst->list);
-	set = (!dst->nr_items && src->nr_items);
-	dst->nr_items += src->nr_items;
-	if (set)
+
+	if (src->nr_items) {
+		dst->nr_items += src->nr_items;
 		memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
-	src->nr_items = 0;
+		src->nr_items = 0;
+	}
 
 	spin_unlock_irq(&nlru->lock);
 }


More information about the Devel mailing list