[Devel] [PATCH RHEL7 COMMIT] mm: vmscan: fix do_try_to_free_pages() livelock

Vladimir Davydov vdavydov at odin.com
Mon Sep 7 03:15:45 PDT 2015


The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.7
------>
commit a6a2f003599a2729fd5f5d9fabcfc979b29b1b3b
Author: Lisa Du <cldu at marvell.com>
Date:   Mon Sep 7 14:15:45 2015 +0400

    mm: vmscan: fix do_try_to_free_pages() livelock
    
    This patch is based on KOSAKI's work and I add a little more description,
    please refer https://lkml.org/lkml/2012/6/14/74.
    
    Currently, I found system can enter a state that there are lots of free
    pages in a zone but only order-0 and order-1 pages which means the zone is
    heavily fragmented, then high order allocation could make direct reclaim
    path's long stall(ex, 60 seconds) especially in no swap and no compaciton
    enviroment.  This problem happened on v3.4, but it seems issue still lives
    in current tree, the reason is do_try_to_free_pages enter live lock:
    
    kswapd will go to sleep if the zones have been fully scanned and are still
    not balanced.  As kswapd thinks there's little point trying all over again
    to avoid infinite loop.  Instead it changes order from high-order to
    0-order because kswapd think order-0 is the most important.  Look at
    73ce02e9 in detail.  If watermarks are ok, kswapd will go back to sleep
    and may leave zone->all_unreclaimable =3D 0.  It assume high-order users
    can still perform direct reclaim if they wish.
    
    Direct reclaim continue to reclaim for a high order which is not a
    COSTLY_ORDER without oom-killer until kswapd turn on
    zone->all_unreclaimble= .  This is because to avoid too early oom-kill.
    So it means direct_reclaim depends on kswapd to break this loop.
    
    In worst case, direct-reclaim may continue to page reclaim forever when
    kswapd sleeps forever until someone like watchdog detect and finally kill
    the process.  As described in:
    http://thread.gmane.org/gmane.linux.kernel.mm/103737
    
    We can't turn on zone->all_unreclaimable from direct reclaim path because
    direct reclaim path don't take any lock and this way is racy.  Thus this
    patch removes zone->all_unreclaimable field completely and recalculates
    zone reclaimable state every time.
    
    Note: we can't take the idea that direct-reclaim see zone->pages_scanned
    directly and kswapd continue to use zone->all_unreclaimable.  Because, it
    is racy.  commit 929bea7c71 (vmscan: all_unreclaimable() use
    zone->all_unreclaimable as a name) describes the detail.
    
    [akpm at linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
    Cc: Aaditya Kumar <aaditya.kumar.30 at gmail.com>
    Cc: Ying Han <yinghan at google.com>
    Cc: Nick Piggin <npiggin at gmail.com>
    Acked-by: Rik van Riel <riel at redhat.com>
    Cc: Mel Gorman <mel at csn.ul.ie>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu at jp.fujitsu.com>
    Cc: Christoph Lameter <cl at linux.com>
    Cc: Bob Liu <lliubbo at gmail.com>
    Cc: Neil Zhang <zhangwm at marvell.com>
    Cc: Russell King - ARM Linux <linux at arm.linux.org.uk>
    Reviewed-by: Michal Hocko <mhocko at suse.cz>
    Acked-by: Minchan Kim <minchan at kernel.org>
    Acked-by: Johannes Weiner <hannes at cmpxchg.org>
    Signed-off-by: KOSAKI Motohiro <kosaki.motohiro at jp.fujitsu.com>
    Signed-off-by: Lisa Du <cldu at marvell.com>
    Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
    
    (cherry picked from commit 6e543d5780e36ff5ee56c44d7e2e30db3457a7ed)
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
    Link: https://jira.sw.ru/browse/PSBM-35155
    Reviewed-by: Kirill Tkhai <ktkhai at odin.com>
    
    Conflicts:
    	include/linux/vmstat.h
    	mm/page-writeback.c
    	mm/vmscan.c
    	mm/vmstat.c
---
 include/linux/mmzone.h |  1 -
 mm/internal.h          |  1 +
 mm/migrate.c           |  2 +-
 mm/page_alloc.c        |  4 +---
 mm/vmscan.c            | 28 ++++++++++++----------------
 mm/vmstat.c            |  4 +++-
 6 files changed, 18 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9b42e3329d30..abe7110d8fbe 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -361,7 +361,6 @@ struct zone {
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	int                     all_unreclaimable; /* All pages pinned */
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 	/* Set to true when the PG_migrate_skip bits should be cleared */
 	bool			compact_blockskip_flush;
diff --git a/mm/internal.h b/mm/internal.h
index 0d1d496cc475..2524d31e058c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -88,6 +88,7 @@ extern unsigned long highest_memmap_pfn;
  */
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
+extern bool zone_reclaimable(struct zone *zone);
 
 /*
  * in mm/rmap.c:
diff --git a/mm/migrate.c b/mm/migrate.c
index 3822ed7398cf..1002661f6ceb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1530,7 +1530,7 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 		if (!populated_zone(zone))
 			continue;
 
-		if (zone->all_unreclaimable)
+		if (!zone_reclaimable(zone))
 			continue;
 
 		/* Avoid waking kswapd by allocating pages_to_migrate pages. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ce719f8cdc0d..0dbb6e95061e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -663,7 +663,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	int to_free = count;
 
 	spin_lock(&zone->lock);
-	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
 	while (to_free) {
@@ -712,7 +711,6 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 				int migratetype)
 {
 	spin_lock(&zone->lock);
-	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
 	__free_one_page(page, zone, order, migratetype);
@@ -3227,7 +3225,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
 			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
 			zone->pages_scanned,
-			(zone->all_unreclaimable ? "yes" : "no")
+			(!zone_reclaimable(zone) ? "yes" : "no")
 			);
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e0f4e2948b30..fd56b73ea286 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -165,7 +165,7 @@ static unsigned long zone_reclaimable_pages(struct zone *zone)
 	return nr;
 }
 
-static bool zone_reclaimable(struct zone *zone)
+bool zone_reclaimable(struct zone *zone)
 {
 	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
 }
@@ -1972,7 +1972,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (current_is_kswapd() && zone->all_unreclaimable)
+	if (current_is_kswapd() && !zone_reclaimable(zone))
 		force_scan = true;
 	if (!global_reclaim(sc))
 		force_scan = true;
@@ -2498,9 +2498,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (global_reclaim(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-
-			if (zone->all_unreclaimable &&
-					sc->priority != DEF_PRIORITY)
+			if (sc->priority != DEF_PRIORITY &&
+			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
 			if (IS_ENABLED(CONFIG_COMPACTION)) {
 				/*
@@ -2554,7 +2553,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
 			continue;
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;
-		if (!zone->all_unreclaimable)
+		if (zone_reclaimable(zone))
 			return false;
 	}
 
@@ -2980,7 +2979,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 		 * DEF_PRIORITY. Effectively, it considers them balanced so
 		 * they must be considered balanced here as well!
 		 */
-		if (zone->all_unreclaimable) {
+		if (!zone_reclaimable(zone)) {
 			balanced_pages += zone->managed_pages;
 			continue;
 		}
@@ -3082,9 +3081,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	/* Account for the number of pages attempted to reclaim */
 	*nr_attempted += sc->nr_to_reclaim;
 
-	if (!zone_reclaimable(zone))
-		zone->all_unreclaimable = 1;
-
 	zone_clear_flag(zone, ZONE_WRITEBACK);
 
 	/*
@@ -3093,7 +3089,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	 * BDIs but as pressure is relieved, speculatively avoid congestion
 	 * waits.
 	 */
-	if (!zone->all_unreclaimable &&
+	if (zone_reclaimable(zone) &&
 	    zone_balanced(zone, testorder, 0, classzone_idx)) {
 		zone_clear_flag(zone, ZONE_CONGESTED);
 		zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
@@ -3160,8 +3156,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 
 			zone_update_force_scan(zone);
 
-			if (zone->all_unreclaimable &&
-			    sc.priority != DEF_PRIORITY)
+			if (sc.priority != DEF_PRIORITY &&
+			    !zone_reclaimable(zone))
 				continue;
 
 			/*
@@ -3237,8 +3233,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable &&
-			    sc.priority != DEF_PRIORITY)
+			if (sc.priority != DEF_PRIORITY &&
+			    !zone_reclaimable(zone))
 				continue;
 
 			sc.nr_scanned = 0;
@@ -3759,7 +3755,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
 		return ZONE_RECLAIM_FULL;
 
-	if (zone->all_unreclaimable)
+	if (!zone_reclaimable(zone))
 		return ZONE_RECLAIM_FULL;
 
 	/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f195c0cbf6ee..4512a65c9430 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -21,6 +21,8 @@
 #include <linux/writeback.h>
 #include <linux/compaction.h>
 
+#include "internal.h"
+
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
 EXPORT_PER_CPU_SYMBOL(vm_event_states);
@@ -1062,7 +1064,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n  start_pfn:         %lu"
 		   "\n  inactive_ratio:    %u"
 		   "\n  force_scan:        %d",
-		   zone->all_unreclaimable,
+		   !zone_reclaimable(zone),
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio,
 		   zone->force_scan);



More information about the Devel mailing list