[Devel] [PATCH RH8 v2] mm: fix hanging shrinker management on long do_shrink_slab
Andrey Zhadchenko
andrey.zhadchenko at virtuozzo.com
Thu Jun 24 17:44:39 MSK 2021
From: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
We have a problem that shrinker_rwsem can be held for a long time for
read, while many processes are trying to manage shrinkers and thus take
the lock for write - all such processes hang like:
crash> bt ffffa0117d226000
PID: 45299 TASK: ffffa0117d226000 CPU: 16 COMMAND: "vzctl"
#0 [ffff9ffe8b793b28] __schedule at ffffffffaf799a12
#1 [ffff9ffe8b793bb0] schedule at ffffffffaf799ed9
#2 [ffff9ffe8b793bc0] rwsem_down_write_failed at ffffffffaf79b815
#3 [ffff9ffe8b793c58] call_rwsem_down_write_failed at ffffffffaf3ae807
#4 [ffff9ffe8b793ca0] down_write at ffffffffaf7991ed
#5 [ffff9ffe8b793cb8] register_shrinker at ffffffffaf1ddc8d
#6 [ffff9ffe8b793cd8] sget_userns at ffffffffaf264329
#7 [ffff9ffe8b793d30] sget at ffffffffaf2643fd
#8 [ffff9ffe8b793d70] mount_bdev at ffffffffaf26451a
#9 [ffff9ffe8b793de0] ext4_mount at ffffffffc07d9f94 [ext4]
#10 [ffff9ffe8b793e10] mount_fs at ffffffffaf2652de
#11 [ffff9ffe8b793e58] vfs_kern_mount at ffffffffaf283a27
#12 [ffff9ffe8b793e90] do_mount at ffffffffaf2863e9
#13 [ffff9ffe8b793f18] sys_mount at ffffffffaf287263
#14 [ffff9ffe8b793f50] system_call_fastpath at ffffffffaf7a6edb
RIP: 00007fb4d1d3978a RSP: 00007fff835a1f60 RFLAGS: 00010206
RAX: 00000000000000a5 RBX: 00007fff835a43f0 RCX: 00007fb4d2ff2820
RDX: 00007fb4d205d4b7 RSI: 0000558e62ddf520 RDI: 00007fff835a4290
RBP: 00007fb4d205d4b7 R8: 00007fff835a2ba0 R9: 00007fb4d1c8820d
R10: 0000000000000000 R11: 0000000000000206 R12: 00007fff835a2ba0
R13: 00007fff835a4290 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b
This means that mounting anything new is currently impossible as
register_shrinker can't take the lock, similar thing for umounting.
The holder of shrinker_rwsem is hanging in do_shrink_slab:
crash> bt ffffa02ebab7c000
PID: 977061 TASK: ffffa02ebab7c000 CPU: 30 COMMAND: "crond"
#0 [ffffa02a7b2b74e8] __schedule at ffffffffaf799a12
#1 [ffffa02a7b2b7570] schedule at ffffffffaf799ed9
#2 [ffffa02a7b2b7580] rpc_wait_bit_killable at ffffffffc02d12f1 [sunrpc]
#3 [ffffa02a7b2b7598] __wait_on_bit at ffffffffaf797b37
#4 [ffffa02a7b2b75d8] out_of_line_wait_on_bit at ffffffffaf797ca1
#5 [ffffa02a7b2b7650] __rpc_wait_for_completion_task at ffffffffc02d12dd [sunrpc]
#6 [ffffa02a7b2b7660] _nfs4_proc_delegreturn at ffffffffc0bcabfa [nfsv4]
#7 [ffffa02a7b2b7710] nfs4_proc_delegreturn at ffffffffc0bd0b31 [nfsv4]
#8 [ffffa02a7b2b7788] nfs_do_return_delegation at ffffffffc0be4699 [nfsv4]
#9 [ffffa02a7b2b77a8] nfs_inode_return_delegation_noreclaim at ffffffffc0be53c7 [nfsv4]
#10 [ffffa02a7b2b77c0] nfs4_evict_inode at ffffffffc0be3a19 [nfsv4]
#11 [ffffa02a7b2b77d8] evict at ffffffffaf27e7b4
#12 [ffffa02a7b2b7800] dispose_list at ffffffffaf27e8be
#13 [ffffa02a7b2b7828] prune_icache_sb at ffffffffaf27f9e8
#14 [ffffa02a7b2b7880] super_cache_scan at ffffffffaf264ae7
#15 [ffffa02a7b2b78c0] do_shrink_slab at ffffffffaf1dd443
#16 [ffffa02a7b2b7938] shrink_slab at ffffffffaf1ddef4
#17 [ffffa02a7b2b79b8] shrink_zone at ffffffffaf1e1451
#18 [ffffa02a7b2b7a60] do_try_to_free_pages at ffffffffaf1e2790
#19 [ffffa02a7b2b7b08] try_to_free_mem_cgroup_pages at ffffffffaf1e2e3e
#20 [ffffa02a7b2b7ba8] try_charge at ffffffffc0a33e8c [kpatch_cumulative_89_2_r1]
#21 [ffffa02a7b2b7c70] memcg_charge_kmem at ffffffffaf251611
#22 [ffffa02a7b2b7c90] __memcg_kmem_newpage_charge at ffffffffaf2519c4
#23 [ffffa02a7b2b7cb8] __alloc_pages_nodemask at ffffffffaf1d4696
#24 [ffffa02a7b2b7d70] alloc_pages_current at ffffffffaf226fc8
#25 [ffffa02a7b2b7db8] pte_alloc_one at ffffffffaf078d17
#26 [ffffa02a7b2b7dd0] __pte_alloc at ffffffffaf1fe683
#27 [ffffa02a7b2b7e08] handle_mm_fault at ffffffffaf204e59
#28 [ffffa02a7b2b7eb0] __do_page_fault at ffffffffaf7a16e3
#29 [ffffa02a7b2b7f20] do_page_fault at ffffffffaf7a1a05
#30 [ffffa02a7b2b7f50] page_fault at ffffffffaf79d768
RIP: 00007f1e54913fa0 RSP: 00007ffe50a21898 RFLAGS: 00010206
RAX: 00007f1e50bce1f0 RBX: 000000037ffff1a0 RCX: 00007f1e549226a7
RDX: 0000000000000002 RSI: 0000557241e48a20 RDI: 0000557241e489e0
RBP: 00007ffe50a21a00 R8: 00007f1e50bce000 R9: 0000000070000021
R10: 0000000000000031 R11: 0000000000000246 R12: 0000557241e489e0
R13: 00007ffe50a21ae0 R14: 0000000000000003 R15: 000000006ffffeff
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
It takes shrinker_rwsem in shrink_slab while traversing shrinker_list.
It tries to shrink something on nfs (hard) but nfs server is dead at
these moment already and rpc will never succeed. Generally any shrinker
can take significant time to do_shrink_slab, so it's a bad idea to hold
the list lock here.
We have a similar problem in shrink_slab_memcg, except that we are
traversing shrinker_map+shrinker_idr there.
The idea of the patch is to inc a refcount to the chosen shrinker so it
won't disappear and release shrinker_rwsem while we are in
do_shrink_slab, after processing we will reacquire shrinker_rwsem, dec
the refcount and continue the traversal.
We also need a wait_queue so that unregister_shrinker can wait for the
refcnt to become zero. Only after these we can safely remove the
shrinker from list and idr, and free the shrinker.
I've also managed to reproduce the nfs hang in do_shrink_slab with the
patch applied, all other mount/unmount and all other CT start/stop pass
fine without any hang. See more info on the reproduction in jira.
https://jira.sw.ru/browse/PSBM-99181
v2: protect wake_up in put_shrinker from freed shrinker via rcu
Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
Acked-by: Kirill Tkhai <ktkhai at virtuozzo.com>
+++
memcg: shrink_slab_memcg() cleanup
After update shrink_slab_memcg() releases shrinker_rwsem and then takes
it back. It makes useless rwsem_is_contended(&shrinker_rwsem) check.
mFixes: 311e5a4499163 ("mm: fix hanging shrinker management on long
do_shrink_slab")
Signed-off-by: Vasily Averin <vvs at virtuozzo.com>
Acked-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
+++
mm: Fix lost wakeup in put_shrinker()
We must wake unregister_shrinker() up on refcnt == 0.
Waking it up on non zero refcnt is useless.
https://pmc.acronis.com/browse/VSTOR-30477
mFixes: 311e5a449916 "mm: fix hanging shrinker management on long do_shrink_slab"
Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
+++
mm: Reduce access frequency to shrinker_rwsem during shrink_slab
Bug https://jira.sw.ru/browse/PSBM-99181 has introduced a problem: when
the kernel has opened NFS delegations and NFS server is not accessible
at the time when NFS shrinker is called, the whole shrinker list
execution gets stuck until NFS server is back. Being a problem in itself
it also introduces bigger problem - during that hang, the shrinker_rwsem
also gets locked, consequently no new mounts can be done at that time
because new superblock tries to register it's own shrinker and also gets
stuck at aquiring shrinker_rwsem.
Commit 9e9e35d050955648449498827deb2d43be0564e1 is a workaround for that
problem. It is known that during signle shrinker execution we do not
actually need to hold shrinker_rwsem so we release and reacqiure the
rwsem for each shrinker in the list.
Because of this workaround shrink_slab function now experiences a major
slowdown, because shrinker_rwsem gets accessed for each shrinker in the
list twice. On an idle fresh-booted system shrinker_list could be
iterated up to 1600 times a second, although originally the problem was
local to only one NFS shrinker.
This patch fixes commit 9e9e35d050955648449498827deb2d43be0564e1 in a
way that before calling for up_read for shrinker_rwsem, we check that
this is really an NFS shrinker by checking NFS magic in superblock, if
it is accessible from shrinker.
Changes:
v2: Added missing 'rwsem_is_contented' check
https://jira.sw.ru/browse/PSBM-99181
Co-authored-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
Signed-off-by: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
Reviewed-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
Rebased to vz8: merged
- 311e5a449916 ("mm: fix hanging shrinker management on long do_shrink_slab")
- 935c73caaf36 ("memcg: shrink_slab_memcg() cleanup")
- afa788f30454 ("mm: Fix lost wakeup in put_shrinker()")
- adb31bddd253 ("mm: Reduce access frequency to shrinker_rwsem during shrink_slab(uses shrink_slab_memcg)")
Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>
---
v2: fixed trailing whitespace and some comments over 80 chars
fs/super.c | 2 +-
include/linux/memcontrol.h | 6 ++
include/linux/shrinker.h | 6 ++
mm/memcontrol.c | 14 +++++
mm/vmscan.c | 144 ++++++++++++++++++++++++++++++++++++++++++---
5 files changed, 163 insertions(+), 9 deletions(-)
diff --git a/fs/super.c b/fs/super.c
index bacf015..846fd64 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -77,7 +77,7 @@ bool dcache_is_low(struct mem_cgroup *memcg)
* shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
* take a passive reference to the superblock to avoid this from occurring.
*/
-static unsigned long super_cache_scan(struct shrinker *shrink,
+unsigned long super_cache_scan(struct shrinker *shrink,
struct shrink_control *sc)
{
struct super_block *sb;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d4d4916..a36f2d6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1513,6 +1513,9 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
int nid, int shrinker_id);
+extern void memcg_clear_shrinker_bit(struct mem_cgroup *memcg,
+ int nid, int shrinker_id);
+
struct mem_cgroup *mem_cgroup_from_obj(void *p);
#else
@@ -1561,6 +1564,9 @@ static inline void memcg_put_cache_ids(void)
static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
int nid, int shrinker_id) { }
+static inline void memcg_clear_shrinker_bit(struct mem_cgroup *memcg,
+ int nid, int shrinker_id) { }
+
static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
{
return NULL;
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index f6b349f..de9b194 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -2,6 +2,9 @@
#ifndef _LINUX_SHRINKER_H
#define _LINUX_SHRINKER_H
+#include <linux/refcount.h>
+#include <linux/wait.h>
+
/*
* This struct is used to pass information from page reclaim to the shrinkers.
* We consolidate the values for easier extention later.
@@ -75,6 +78,9 @@ struct shrinker {
struct list_head list;
/* objs pending delete, per node */
atomic_long_t *nr_deferred;
+
+ refcount_t refcnt;
+ wait_queue_head_t wq;
};
#define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 329223b..72af766 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -442,6 +442,20 @@ void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
}
}
+void memcg_clear_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
+{
+ struct memcg_shrinker_map *map;
+
+ /*
+ * The map for refcounted memcg can only be freed in
+ * memcg_free_shrinker_map_rcu so we can safely protect
+ * map with rcu_read_lock.
+ */
+ rcu_read_lock();
+ map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map);
+ clear_bit(shrinker_id, map->map);
+ rcu_read_unlock();
+}
#else /* CONFIG_MEMCG_KMEM */
static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
{
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bc1f102..f81fabf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -348,6 +348,13 @@ int prealloc_shrinker(struct shrinker *shrinker)
if (!shrinker->nr_deferred)
return -ENOMEM;
+ /*
+ * Shrinker is not yet visible through shrinker_idr or shrinker_list,
+ * so no locks required for initialization.
+ */
+ refcount_set(&shrinker->refcnt, 1);
+ init_waitqueue_head(&shrinker->wq);
+
if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
if (prealloc_memcg_shrinker(shrinker))
goto free_deferred;
@@ -402,6 +409,13 @@ void unregister_shrinker(struct shrinker *shrinker)
{
if (!shrinker->nr_deferred)
return;
+ /*
+ * If refcnt is not zero we need to wait these shrinker to finish all
+ * it's do_shrink_slab() calls.
+ */
+ if (!refcount_dec_and_test(&shrinker->refcnt))
+ wait_event(shrinker->wq, (refcount_read(&shrinker->refcnt) == 0));
+
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
down_write(&shrinker_rwsem);
@@ -409,6 +423,9 @@ void unregister_shrinker(struct shrinker *shrinker)
up_write(&shrinker_rwsem);
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
+
+ /* Pairs with rcu_read_lock in put_shrinker() */
+ synchronize_rcu();
}
EXPORT_SYMBOL(unregister_shrinker);
@@ -534,6 +551,50 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
return freed;
}
+unsigned long super_cache_scan(struct shrinker *shrink,
+ struct shrink_control *sc);
+
+static inline bool is_nfs_shrinker(struct shrinker *shrinker)
+{
+ struct super_block *sb = container_of(shrinker,
+ struct super_block, s_shrink);
+
+ if (shrinker->scan_objects == &super_cache_scan)
+ return sb->s_magic == NFS_SUPER_MAGIC;
+
+ return false;
+}
+
+struct shrinker *get_shrinker(struct shrinker *shrinker)
+{
+ /*
+ * Pairs with refcnt dec in unregister_shrinker(), if refcnt is zero
+ * these shrinker is already in the middle of unregister_shrinker() and
+ * we can't use it.
+ */
+ if (!refcount_inc_not_zero(&shrinker->refcnt))
+ shrinker = NULL;
+ return shrinker;
+}
+
+void put_shrinker(struct shrinker *shrinker)
+{
+ /*
+ * If refcnt becomes zero, we already have an unregister_shrinker()
+ * waiting for us to finish. The rcu_read_lock pairs with
+ * synchronize_rcu() in unregister_shrinker(), so that the shrinker is
+ * not freed before wake_up.
+ */
+ rcu_read_lock();
+ if (refcount_dec_and_test(&shrinker->refcnt)) {
+ /* Pairs with smp_mb in wait_event()->prepare_to_wait() */
+ smp_mb();
+ if (waitqueue_active(&shrinker->wq))
+ wake_up(&shrinker->wq);
+ }
+ rcu_read_unlock();
+}
+
#ifdef CONFIG_MEMCG_KMEM
static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
struct mem_cgroup *memcg, int priority)
@@ -560,6 +621,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
.memcg = memcg,
};
struct shrinker *shrinker;
+ bool is_nfs;
cond_resched();
@@ -570,9 +632,31 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
continue;
}
+ is_nfs = is_nfs_shrinker(shrinker);
+
+ /*
+ * Take a refcnt on a shrinker so that it can't be freed or
+ * removed from shrinker_idr (and shrinker_list). These way we
+ * can safely release shrinker_rwsem.
+ *
+ * We need to release shrinker_rwsem here as do_shrink_slab can
+ * take too much time to finish (e.g. on nfs). And holding
+ * global shrinker_rwsem can block registring and unregistring
+ * of shrinkers.
+ *
+ * The up_read logic should only be executed for nfs shrinker
+ * path, because it has proven to hang. For others it should be
+ * skipped to reduce performance penalties.
+ */
+ if (is_nfs) {
+ if (!get_shrinker(shrinker))
+ continue;
+ up_read(&shrinker_rwsem);
+ }
+
ret = do_shrink_slab(&sc, shrinker, priority);
if (ret == SHRINK_EMPTY) {
- clear_bit(i, map->map);
+ memcg_clear_shrinker_bit(memcg, nid, i);
/*
* After the shrinker reported that it had no objects to
* free, but before we cleared the corresponding bit in
@@ -597,7 +681,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
}
freed += ret;
- if (rwsem_is_contended(&shrinker_rwsem)) {
+ if (is_nfs) {
+ /*
+ * Re-aquire shrinker_rwsem, put refcount and reload
+ * shrinker_map as it can be modified in
+ * memcg_expand_one_shrinker_map if new shrinkers
+ * were registred in the meanwhile.
+ */
+
+ if (!down_read_trylock(&shrinker_rwsem)) {
+ freed = freed ? : 1;
+ put_shrinker(shrinker);
+ return freed;
+ }
+ put_shrinker(shrinker);
+ map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
+ true);
+ } else if (rwsem_is_contended(&shrinker_rwsem)) {
freed = freed ? : 1;
break;
}
@@ -665,19 +765,47 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
.for_drop_caches = for_drop_caches,
};
+ bool is_nfs = is_nfs_shrinker(shrinker);
+
+ if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+ sc.nid = 0;
+
+ /* See comment in shrink_slab_memcg. */
+ if (is_nfs) {
+ if (!get_shrinker(shrinker))
+ continue;
+ up_read(&shrinker_rwsem);
+ }
+
ret = do_shrink_slab(&sc, shrinker, priority);
if (ret == SHRINK_EMPTY)
ret = 0;
freed += ret;
- /*
- * Bail out if someone want to register a new shrinker to
- * prevent the regsitration from being stalled for long periods
- * by parallel ongoing shrinking.
- */
- if (rwsem_is_contended(&shrinker_rwsem)) {
+
+ if (is_nfs) {
+ /*
+ * We can safely continue traverse of the
+ * shrinker_list as our shrinker is still on the list
+ * due to refcount.
+ */
+
+ if (!down_read_trylock(&shrinker_rwsem)) {
+ freed = freed ? : 1;
+ put_shrinker(shrinker);
+ goto out;
+ }
+ put_shrinker(shrinker);
+ } else if (rwsem_is_contended(&shrinker_rwsem)) {
+ /*
+ * Bail out if someone want to register a new shrinker
+ * to prevent the regsitration from being stalled for
+ * long periods by parallel ongoing shrinking.
+ */
+
freed = freed ? : 1;
break;
}
+
}
up_read(&shrinker_rwsem);
--
1.8.3.1
More information about the Devel
mailing list