[Devel] [PATCH RH7] mm: fix hanging shrinker management on long do_shrink_slab
Pavel Tikhomirov
ptikhomirov at virtuozzo.com
Fri Nov 29 16:52:19 MSK 2019
We have a problem that shrinker_rwsem can be held for a long time for
read, while many processes are trying to manage shrinkers and thus take
the lock for write - all such processes hang like:
crash> bt ffffa0117d226000
PID: 45299 TASK: ffffa0117d226000 CPU: 16 COMMAND: "vzctl"
#0 [ffff9ffe8b793b28] __schedule at ffffffffaf799a12
#1 [ffff9ffe8b793bb0] schedule at ffffffffaf799ed9
#2 [ffff9ffe8b793bc0] rwsem_down_write_failed at ffffffffaf79b815
#3 [ffff9ffe8b793c58] call_rwsem_down_write_failed at ffffffffaf3ae807
#4 [ffff9ffe8b793ca0] down_write at ffffffffaf7991ed
#5 [ffff9ffe8b793cb8] register_shrinker at ffffffffaf1ddc8d
#6 [ffff9ffe8b793cd8] sget_userns at ffffffffaf264329
#7 [ffff9ffe8b793d30] sget at ffffffffaf2643fd
#8 [ffff9ffe8b793d70] mount_bdev at ffffffffaf26451a
#9 [ffff9ffe8b793de0] ext4_mount at ffffffffc07d9f94 [ext4]
#10 [ffff9ffe8b793e10] mount_fs at ffffffffaf2652de
#11 [ffff9ffe8b793e58] vfs_kern_mount at ffffffffaf283a27
#12 [ffff9ffe8b793e90] do_mount at ffffffffaf2863e9
#13 [ffff9ffe8b793f18] sys_mount at ffffffffaf287263
#14 [ffff9ffe8b793f50] system_call_fastpath at ffffffffaf7a6edb
RIP: 00007fb4d1d3978a RSP: 00007fff835a1f60 RFLAGS: 00010206
RAX: 00000000000000a5 RBX: 00007fff835a43f0 RCX: 00007fb4d2ff2820
RDX: 00007fb4d205d4b7 RSI: 0000558e62ddf520 RDI: 00007fff835a4290
RBP: 00007fb4d205d4b7 R8: 00007fff835a2ba0 R9: 00007fb4d1c8820d
R10: 0000000000000000 R11: 0000000000000206 R12: 00007fff835a2ba0
R13: 00007fff835a4290 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b
This means that mounting anything new is currently impossible as
register_shrinker can't take the lock, similar thing for umounting.
The holder of shrinker_rwsem is hanging in do_shrink_slab:
crash> bt ffffa02ebab7c000
PID: 977061 TASK: ffffa02ebab7c000 CPU: 30 COMMAND: "crond"
#0 [ffffa02a7b2b74e8] __schedule at ffffffffaf799a12
#1 [ffffa02a7b2b7570] schedule at ffffffffaf799ed9
#2 [ffffa02a7b2b7580] rpc_wait_bit_killable at ffffffffc02d12f1 [sunrpc]
#3 [ffffa02a7b2b7598] __wait_on_bit at ffffffffaf797b37
#4 [ffffa02a7b2b75d8] out_of_line_wait_on_bit at ffffffffaf797ca1
#5 [ffffa02a7b2b7650] __rpc_wait_for_completion_task at ffffffffc02d12dd [sunrpc]
#6 [ffffa02a7b2b7660] _nfs4_proc_delegreturn at ffffffffc0bcabfa [nfsv4]
#7 [ffffa02a7b2b7710] nfs4_proc_delegreturn at ffffffffc0bd0b31 [nfsv4]
#8 [ffffa02a7b2b7788] nfs_do_return_delegation at ffffffffc0be4699 [nfsv4]
#9 [ffffa02a7b2b77a8] nfs_inode_return_delegation_noreclaim at ffffffffc0be53c7 [nfsv4]
#10 [ffffa02a7b2b77c0] nfs4_evict_inode at ffffffffc0be3a19 [nfsv4]
#11 [ffffa02a7b2b77d8] evict at ffffffffaf27e7b4
#12 [ffffa02a7b2b7800] dispose_list at ffffffffaf27e8be
#13 [ffffa02a7b2b7828] prune_icache_sb at ffffffffaf27f9e8
#14 [ffffa02a7b2b7880] super_cache_scan at ffffffffaf264ae7
#15 [ffffa02a7b2b78c0] do_shrink_slab at ffffffffaf1dd443
#16 [ffffa02a7b2b7938] shrink_slab at ffffffffaf1ddef4
#17 [ffffa02a7b2b79b8] shrink_zone at ffffffffaf1e1451
#18 [ffffa02a7b2b7a60] do_try_to_free_pages at ffffffffaf1e2790
#19 [ffffa02a7b2b7b08] try_to_free_mem_cgroup_pages at ffffffffaf1e2e3e
#20 [ffffa02a7b2b7ba8] try_charge at ffffffffc0a33e8c [kpatch_cumulative_89_2_r1]
#21 [ffffa02a7b2b7c70] memcg_charge_kmem at ffffffffaf251611
#22 [ffffa02a7b2b7c90] __memcg_kmem_newpage_charge at ffffffffaf2519c4
#23 [ffffa02a7b2b7cb8] __alloc_pages_nodemask at ffffffffaf1d4696
#24 [ffffa02a7b2b7d70] alloc_pages_current at ffffffffaf226fc8
#25 [ffffa02a7b2b7db8] pte_alloc_one at ffffffffaf078d17
#26 [ffffa02a7b2b7dd0] __pte_alloc at ffffffffaf1fe683
#27 [ffffa02a7b2b7e08] handle_mm_fault at ffffffffaf204e59
#28 [ffffa02a7b2b7eb0] __do_page_fault at ffffffffaf7a16e3
#29 [ffffa02a7b2b7f20] do_page_fault at ffffffffaf7a1a05
#30 [ffffa02a7b2b7f50] page_fault at ffffffffaf79d768
RIP: 00007f1e54913fa0 RSP: 00007ffe50a21898 RFLAGS: 00010206
RAX: 00007f1e50bce1f0 RBX: 000000037ffff1a0 RCX: 00007f1e549226a7
RDX: 0000000000000002 RSI: 0000557241e48a20 RDI: 0000557241e489e0
RBP: 00007ffe50a21a00 R8: 00007f1e50bce000 R9: 0000000070000021
R10: 0000000000000031 R11: 0000000000000246 R12: 0000557241e489e0
R13: 00007ffe50a21ae0 R14: 0000000000000003 R15: 000000006ffffeff
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
It takes shrinker_rwsem in shrink_slab while traversing shrinker_list.
It tries to shrink something on nfs (hard) but nfs server is dead at
these moment already and rpc will never succeed. Generally any shrinker
can take significant time to do_shrink_slab, so it's a bad idea to hold
the list lock here.
We have a similar problem in shrink_slab_memcg, except that we are
traversing shrinker_map+shrinker_idr there.
The idea of the patch is to inc a refcount to the chosen shrinker so it
won't disappear and release shrinker_rwsem while we are in
do_shrink_slab, after processing we will reacquire shrinker_rwsem, dec
the refcount and continue the traversal.
We also need a wait_queue so that unregister_shrinker can wait for the
refcnt to become zero. Only after these we can safely remove the
shrinker from list and idr, and free the shrinker.
I've also managed to reproduce the nfs hang in do_shrink_slab with the
patch applied, all other mount/unmount and all other CT start/stop pass
fine without any hang. See more info on the reproduction in jira.
https://jira.sw.ru/browse/PSBM-99181
v2: protect wake_up in put_shrinker from freed shrinker via rcu
Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
---
include/linux/memcontrol.h | 4 ++
include/linux/shrinker.h | 6 +++
mm/memcontrol.c | 15 ++++++
mm/vmscan.c | 93 +++++++++++++++++++++++++++++++++++++-
4 files changed, 117 insertions(+), 1 deletion(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c70279b2616b..c1e50faf00e5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -704,6 +704,8 @@ static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
extern int memcg_expand_shrinker_maps(int new_id);
extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
int nid, int shrinker_id);
+extern void memcg_clear_shrinker_bit(struct mem_cgroup *memcg,
+ int nid, int shrinker_id);
extern struct memcg_shrinker_map *memcg_nid_shrinker_map(struct mem_cgroup *memcg, int nid);
#else
@@ -772,6 +774,8 @@ static inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
}
static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
int nid, int shrinker_id) { }
+static inline void memcg_clear_shrinker_bit(struct mem_cgroup *memcg,
+ int nid, int shrinker_id) { }
#endif /* CONFIG_MEMCG_KMEM */
#endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index f6938dc6c068..2eea35772794 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -1,6 +1,9 @@
#ifndef _LINUX_SHRINKER_H
#define _LINUX_SHRINKER_H
+#include <linux/refcount.h>
+#include <linux/wait.h>
+
/*
* This struct is used to pass information from page reclaim to the shrinkers.
* We consolidate the values for easier extention later.
@@ -68,6 +71,9 @@ struct shrinker {
struct list_head list;
/* objs pending delete, per node */
atomic_long_t *nr_deferred;
+
+ refcount_t refcnt;
+ wait_queue_head_t wq;
};
#define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2f988271d111..077361eb399b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -846,6 +846,21 @@ void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
}
}
+void memcg_clear_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
+{
+ struct memcg_shrinker_map *map;
+
+ /*
+ * The map for refcounted memcg can only be freed in
+ * memcg_free_shrinker_map_rcu so we can safely protect
+ * map with rcu_read_lock.
+ */
+ rcu_read_lock();
+ map = rcu_dereference(memcg->info.nodeinfo[nid]->shrinker_map);
+ clear_bit(shrinker_id, map->map);
+ rcu_read_unlock();
+}
+
struct memcg_shrinker_map *memcg_nid_shrinker_map(struct mem_cgroup *memcg, int nid)
{
return rcu_dereference_protected(memcg->info.nodeinfo[nid]->shrinker_map,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b2e465cd87f4..de7e51584f6f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -303,6 +303,13 @@ int register_shrinker(struct shrinker *shrinker)
if (!shrinker->nr_deferred)
return -ENOMEM;
+ /*
+ * Shrinker is not yet visible through shrinker_idr or shrinker_list,
+ * so no locks required for initialization.
+ */
+ refcount_set(&shrinker->refcnt, 1);
+ init_waitqueue_head(&shrinker->wq);
+
if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
if (register_memcg_shrinker(shrinker))
goto free_deferred;
@@ -328,6 +335,13 @@ void unregister_shrinker(struct shrinker *shrinker)
if (!shrinker->nr_deferred)
return;
+ /*
+ * If refcnt is not zero we need to wait these shrinker to finish all
+ * it's do_shrink_slab() calls.
+ */
+ if (!refcount_dec_and_test(&shrinker->refcnt))
+ wait_event(shrinker->wq, (refcount_read(&shrinker->refcnt) == 0));
+
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
@@ -336,6 +350,9 @@ void unregister_shrinker(struct shrinker *shrinker)
up_write(&shrinker_rwsem);
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
+
+ /* Pairs with rcu_read_lock in put_shrinker() */
+ synchronize_rcu();
}
EXPORT_SYMBOL(unregister_shrinker);
@@ -435,6 +452,36 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
return freed;
}
+struct shrinker *get_shrinker(struct shrinker *shrinker)
+{
+ /*
+ * Pairs with refcnt dec in unregister_shrinker(), if refcnt is zero
+ * these shrinker is already in the middle of unregister_shrinker() and
+ * we can't use it.
+ */
+ if (!refcount_inc_not_zero(&shrinker->refcnt))
+ shrinker = NULL;
+ return shrinker;
+}
+
+void put_shrinker(struct shrinker *shrinker)
+{
+ /*
+ * If refcnt becomes zero, we already have an unregister_shrinker()
+ * waiting for us to finish. The rcu_read_lock pairs with
+ * synchronize_rcu() in unregister_shrinker(), so that the shrinker is
+ * not freed before wake_up.
+ */
+ rcu_read_lock();
+ if (!refcount_dec_and_test(&shrinker->refcnt)) {
+ /* Pairs with smp_mb in wait_event()->prepare_to_wait() */
+ smp_mb();
+ if (waitqueue_active(&shrinker->wq))
+ wake_up(&shrinker->wq);
+ }
+ rcu_read_unlock();
+}
+
#ifdef CONFIG_MEMCG_KMEM
static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
struct mem_cgroup *memcg, int priority)
@@ -468,9 +515,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
continue;
}
+ /*
+ * Take a refcnt on a shrinker so that it can't be freed or
+ * removed from shrinker_idr (and shrinker_list). These way we
+ * can safely release shrinker_rwsem.
+ *
+ * We need to release shrinker_rwsem here as do_shrink_slab can
+ * take too much time to finish (e.g. on nfs). And holding
+ * global shrinker_rwsem can block registring and unregistring
+ * of shrinkers.
+ */
+ if(!get_shrinker(shrinker))
+ continue;
+ up_read(&shrinker_rwsem);
+
ret = do_shrink_slab(&sc, shrinker, priority);
if (ret == SHRINK_EMPTY) {
- clear_bit(i, map->map);
+ memcg_clear_shrinker_bit(memcg, nid, i);
/*
* After the shrinker reported that it had no objects to
* free, but before we cleared the corresponding bit in
@@ -495,6 +556,20 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
}
freed += ret;
+ /*
+ * Re-aquire shrinker_rwsem, put refcount and reload
+ * shrinker_map as it can be modified in
+ * memcg_expand_one_shrinker_map if new shrinkers
+ * were registred in the meanwhile.
+ */
+ if (!down_read_trylock(&shrinker_rwsem)) {
+ freed = freed ? : 1;
+ put_shrinker(shrinker);
+ return freed;
+ }
+ put_shrinker(shrinker);
+ map = memcg_nid_shrinker_map(memcg, nid);
+
if (rwsem_is_contended(&shrinker_rwsem)) {
freed = freed ? : 1;
break;
@@ -577,10 +652,26 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
sc.nid = 0;
+ /* See comment in shrink_slab_memcg. */
+ if(!get_shrinker(shrinker))
+ continue;
+ up_read(&shrinker_rwsem);
+
ret = do_shrink_slab(&sc, shrinker, priority);
if (ret == SHRINK_EMPTY)
ret = 0;
freed += ret;
+
+ /*
+ * We can safely continue traverse of the shrinker_list as
+ * our shrinker is still on the list due to refcount.
+ */
+ if (!down_read_trylock(&shrinker_rwsem)) {
+ freed = freed ? : 1;
+ put_shrinker(shrinker);
+ goto out;
+ }
+ put_shrinker(shrinker);
}
up_read(&shrinker_rwsem);
--
2.21.0
More information about the Devel
mailing list