[Devel] [PATCH RHEL7 COMMIT] ms/mm: memcg: fix race condition between memcg teardown and swapin

Konstantin Khorenko khorenko at virtuozzo.com
Fri Mar 31 07:35:03 PDT 2017


The commit is pushed to "branch-rh7-3.10.0-514.10.2.vz7.29.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-514.10.2.vz7.29.9
------>
commit 0e48b7d95ff8ad6875eeb44523ecdb74971865d0
Author: Johannes Weiner <hannes at cmpxchg.org>
Date:   Fri Mar 31 18:35:03 2017 +0400

    ms/mm: memcg: fix race condition between memcg teardown and swapin
    
    ms commit: 96f1c58 ("mm: memcg: fix race condition between memcg teardown and
    swapin")
    
    There is a race condition between a memcg being torn down and a swapin
    triggered from a different memcg of a page that was recorded to belong
    to the exiting memcg on swapout (with CONFIG_MEMCG_SWAP extension).  The
    result is unreclaimable pages pointing to dead memcgs, which can lead to
    anything from endless loops in later memcg teardown (the page is charged
    to all hierarchical parents but is not on any LRU list) or crashes from
    following the dangling memcg pointer.
    
    Memcgs with tasks in them can not be torn down and usually charges don't
    show up in memcgs without tasks.  Swapin with the CONFIG_MEMCG_SWAP
    extension is the notable exception because it charges the cgroup that
    was recorded as owner during swapout, which may be empty and in the
    process of being torn down when a task in another memcg triggers the
    swapin:
    
      teardown:                 swapin:
    
                                lookup_swap_cgroup_id()
                                rcu_read_lock()
                                mem_cgroup_lookup()
                                css_tryget()
                                rcu_read_unlock()
      disable css_tryget()
      call_rcu()
        offline_css()
          reparent_charges()
                                res_counter_charge() (hierarchical!)
                                css_put()
                                  css_free()
                                pc->mem_cgroup = dead memcg
                                add page to dead lru
    
    Add a final reparenting step into css_free() to make sure any such raced
    charges are moved out of the memcg before it's finally freed.
    
    In the longer term it would be cleaner to have the css_tryget() and the
    res_counter charge under the same RCU lock section so that the charge
    reparenting is deferred until the last charge whose tryget succeeded is
    visible.  But this will require more invasive changes that will be
    harder to evaluate and backport into stable, so better defer them to a
    separate change set.
    
    Signed-off-by: Johannes Weiner <hannes at cmpxchg.org>
    Acked-by: Michal Hocko <mhocko at suse.cz>
    Cc: David Rientjes <rientjes at google.com>
    Cc: <stable at kernel.org>
    Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
    
    https://jira.sw.ru/browse/PSBM-62080
    
    Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
---
 mm/memcontrol.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 95b1c0c..74faf07 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6073,6 +6073,43 @@ static void mem_cgroup_css_free(struct cgroup *cont)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 
+	/*
+	 * XXX: css_offline() would be where we should reparent all
+	 * memory to prepare the cgroup for destruction.  However,
+	 * memcg does not do css_tryget() and res_counter charging
+	 * under the same RCU lock region, which means that charging
+	 * could race with offlining.  Offlining only happens to
+	 * cgroups with no tasks in them but charges can show up
+	 * without any tasks from the swapin path when the target
+	 * memcg is looked up from the swapout record and not from the
+	 * current task as it usually is.  A race like this can leak
+	 * charges and put pages with stale cgroup pointers into
+	 * circulation:
+	 *
+	 * #0                        #1
+	 *                           lookup_swap_cgroup_id()
+	 *                           rcu_read_lock()
+	 *                           mem_cgroup_lookup()
+	 *                           css_tryget()
+	 *                           rcu_read_unlock()
+	 * disable css_tryget()
+	 * call_rcu()
+	 *   offline_css()
+	 *     reparent_charges()
+	 *                           res_counter_charge()
+	 *                           css_put()
+	 *                             css_free()
+	 *                           pc->mem_cgroup = dead memcg
+	 *                           add page to lru
+	 *
+	 * The bulk of the charges are still moved in offline_css() to
+	 * avoid pinning a lot of pages in case a long-term reference
+	 * like a swapout record is deferring the css_free() to long
+	 * after offlining.  But this makes sure we catch any charges
+	 * made after offlining:
+	 */
+	mem_cgroup_reparent_charges(memcg);
+
 	memcg_destroy_kmem(memcg);
 	__mem_cgroup_free(memcg);
 }


More information about the Devel mailing list