[Devel] [PATCH rh7 v5 0/9] mm/mem_cgroup_iter: Reduce the number of iterator restarts upon cgroup removals

Fri Feb 26 17:25:56 MSK 2021

May thanks to Kirill Tkhai for his bright ideas and review!

Problem description from the user point of view:
  * the Node is slow
  * the Node has a lot of free RAM
  * the Node has a lot of swapin/swapout
  * kswapd is always running

Problem in a nutshell from technical point of view:
  * kswapd is looping in shrink_zone() inside the loop
      do {} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
    (and never goes trough the outer loop)
  * there are a quite a number of memory cgroups of the Node (~1000)
  * some cgroups are hard to reclaim (reclaim may take ~3 seconds),
    this is because of very busy disk due to permanent swapin/swapout
  * mem_cgroup_iter() does not have success scanning all cgroups
    in a row, it restarts from the root cgroup one time after
    another (after different number of cgroups scanned)

Q: Why does mem_cgroup_iter() restart from the root memcg?
A: Because it is invalidated once some memory cgroup is
   destroyed on the Node.
   Note: ANY memory cgroup destroy on the Node leads to iter
   restart.

The following patchset solves this problem in the following way:
there is no need to restart the iter until we see the iter has
the position which is exactly the memory cgroup being destroyed.

The patchset ensures the iter->last_visited is NULL-ified on
invalidation and thus restarts only in the unlikely case when
the iter points to the memcg being destroyed.

Testing: i've tested this patchset using modified kernel which breaks
the memcg iterator in case of global reclaim with probability of 2%.

3 kernels have been tested: "release", KASAN-only, "debug" kernels.
Each worked for 12 hours, no issues, from 12000 to 26000 races were
caught during this period (i.e. dying memcg was found in some iterator
and wiped).

The testing scenario is documented in the jira issue.

https://jira.sw.ru/browse/PSBM-123655

v2 changes:
 - reverted 2 patches in this code which were focused on syncronizing
   updates of iter->last_visited and ->last_dead_count
   (as we are getting rid of iter->last_dead_count at all)
 - use rcu primitives to access iter->last_visited

v3 changes:
 - more comments explaining the locking scheme
 - use rcu_read_{lock,unlock}_sched() in mem_cgroup_iter()
   for syncronization with iterator invalidation func
 - do not use rcu_read_{lock/unlock}() wrap in iterator invalidation func
   as it protects nothing

v4 changes:
 - extended comment why iter invalidation function must see all
   pointers to dying memcg and no pointer to it can be written later

v5 changes:
 - droppig barriers (rmb/wmb) are moved from patch [4/9] to
   "[PATCH rh7 v4 7/9] mm/mem_cgroup_iter: Don't bother checking
   'dead_count' anymore", it makes the code valid in between 4th and 7th
   patch.
 - no more changes, resulted trees after v4 and v5 applied are identical

Konstantin Khorenko (9):
  Revert "mm/memcg: fix css_tryget(),css_put() imbalance"
  Revert "mm/memcg: use seqlock to protect reclaim_iter updates"
  mm/mem_cgroup_iter: Make 'iter->last_visited' a bit more stable
  mm/mem_cgroup_iter: Always assign iter->last_visited under rcu
  mm/mem_cgroup_iter: Provide _iter_invalidate() the dying memcg as an
    argument
  mm/mem_cgroup_iter: NULL-ify 'last_visited' for invalidated iterators
  mm/mem_cgroup_iter: Don't bother saving/checking 'dead_count' anymore
  mm/mem_cgroup_iter: Cleanup mem_cgroup_iter_load()
  mm/mem_cgroup_iter: Drop dead_count related infrastructure

 mm/memcontrol.c | 208 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 150 insertions(+), 58 deletions(-)

-- 
2.24.3