[Devel] [PATCH RHEL8 COMMIT] ms/mm: vmscan: move file exhaustion detection to the node level

Thu Apr 2 16:03:03 MSK 2020

The commit is pushed to "branch-rh8-4.18.0-80.1.2.vz8.3.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-80.1.2.vz8.3.4
------>
commit 48de864e1fb2eed8d1e9ef05f517030862ad88a2
Author: Johannes Weiner <hannes at cmpxchg.org>
Date:   Thu Apr 2 16:03:02 2020 +0300

    ms/mm: vmscan: move file exhaustion detection to the node level
    
    Patch series "mm: fix page aging across multiple cgroups".
    
    When applications are put into unconfigured cgroups for memory accounting
    purposes, the cgrouping itself should not change the behavior of the page
    reclaim code.  We expect the VM to reclaim the coldest pages in the
    system.  But right now the VM can reclaim hot pages in one cgroup while
    there is eligible cold cache in others.
    
    This is because one part of the reclaim algorithm isn't truly cgroup
    hierarchy aware: the inactive/active list balancing.  That is the part
    that is supposed to protect hot cache data from one-off streaming IO.
    
    The recursive cgroup reclaim scheme will scan and rotate the physical LRU
    lists of each eligible cgroup at the same rate in a round-robin fashion,
    thereby establishing a relative order among the pages of all those
    cgroups.  However, the inactive/active balancing decisions are made
    locally within each cgroup, so when a cgroup is running low on cold pages,
    its hot pages will get reclaimed - even when sibling cgroups have plenty
    of cold cache eligible in the same reclaim run.
    
    For example:
    
       [root at ham ~]# head -n1 /proc/meminfo
       MemTotal:        1016336 kB
    
       [root at ham ~]# ./reclaimtest2.sh
       Establishing 50M active files in cgroup A...
       Hot pages cached: 12800/12800 workingset-a
       Linearly scanning through 18G of file data in cgroup B:
       real    0m4.269s
       user    0m0.051s
       sys     0m4.182s
       Hot pages cached: 134/12800 workingset-a
    
    The streaming IO in B, which doesn't benefit from caching at all, pushes
    out most of the workingset in A.
    
    Solution
    
    This series fixes the problem by elevating inactive/active balancing
    decisions to the toplevel of the reclaim run.  This is either a cgroup
    that hit its limit, or straight-up global reclaim if there is physical
    memory pressure.  From there, it takes a recursive view of the cgroup
    subtree to decide whether page deactivation is necessary.
    
    In the test above, the VM will then recognize that cgroup B has plenty of
    eligible cold cache, and that the hot pages in A can be spared:
    
       [root at ham ~]# ./reclaimtest2.sh
       Establishing 50M active files in cgroup A...
       Hot pages cached: 12800/12800 workingset-a
       Linearly scanning through 18G of file data in cgroup B:
       real    0m4.244s
       user    0m0.064s
       sys     0m4.177s
       Hot pages cached: 12800/12800 workingset-a
    
    Implementation
    
    Whether active pages can be deactivated or not is influenced by two
    factors: the inactive list dropping below a minimum size relative to the
    active list, and the occurence of refaults.
    
    This patch series first moves refault detection to the reclaim root, then
    enforces the minimum inactive size based on a recursive view of the cgroup
    tree's LRUs.
    
    History
    
    Note that this actually never worked correctly in Linux cgroups.  In the
    past it worked for global reclaim and leaf limit reclaim only (we used to
    have two physical LRU linkages per page), but it never worked for
    intermediate limit reclaim over multiple leaf cgroups.
    
    We're noticing this now because 1) we're putting everything into cgroups
    for accounting, not just the things we want to control and 2) we're moving
    away from leaf limits that invoke reclaim on individual cgroups, toward
    large tree reclaim, triggered by high-level limits, or physical memory
    pressure that is influenced by local protections such as memory.low and
    memory.min instead.
    
    This patch (of 3):
    
    When file pages are lower than the watermark on a node, we try to force
    scan anonymous pages to counter-act the balancing algorithms preference
    for new file pages when they are likely thrashing.  This is a node-level
    decision, but it's currently made each time we look at an lruvec.  This is
    unnecessarily expensive and also a layering violation that makes the code
    harder to understand.
    
    Clean this up by making the check once per node and setting a flag in the
    scan_control.
    
    Link: http://lkml.kernel.org/r/20191107205334.158354-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes at cmpxchg.org>
    Reviewed-by: Shakeel Butt <shakeelb at google.com>
    Reviewed-by: Suren Baghdasaryan <surenb at google.com>
    Cc: Andrey Ryabinin <aryabinin at virtuozzo.com>
    Cc: Michal Hocko <mhocko at suse.com>
    Cc: Rik van Riel <riel at surriel.com>
    Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
    
    (cherry picked from commit 53138cea7f398d2cdd0fa22adeec7e16093e1ebd)
    Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
---
 mm/vmscan.c | 80 ++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 42 insertions(+), 38 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0da6efc306c8..cab7157e3b8b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -111,6 +111,9 @@ struct scan_control {
 	/* One of the zones is ready for compaction */
 	unsigned int compaction_ready:1;
 
+	/* The file pages on the current node are dangerously low */
+	unsigned int file_is_tiny:1;
+
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
 
@@ -2095,45 +2098,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	}
 
 	/*
-	 * Prevent the reclaimer from falling into the cache trap: as
-	 * cache pages start out inactive, every cache fault will tip
-	 * the scan balance towards the file LRU.  And as the file LRU
-	 * shrinks, so does the window for rotation from references.
-	 * This means we have a runaway feedback loop where a tiny
-	 * thrashing file LRU becomes infinitely more attractive than
-	 * anon pages.  Try to detect this based on file LRU size.
+	 * If the system is almost out of file pages, force-scan anon.
+	 * But only if there are enough inactive anonymous pages on
+	 * the LRU. Otherwise, the small LRU gets thrashed.
 	 */
-	if (!cgroup_reclaim(sc)) {
-		unsigned long pgdatfile;
-		unsigned long pgdatfree;
-		int z;
-		unsigned long total_high_wmark = 0;
-
-		pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
-		pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
-			   node_page_state(pgdat, NR_INACTIVE_FILE);
-
-		for (z = 0; z < MAX_NR_ZONES; z++) {
-			struct zone *zone = &pgdat->node_zones[z];
-			if (!managed_zone(zone))
-				continue;
-
-			total_high_wmark += high_wmark_pages(zone);
-		}
-
-		if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
-			/*
-			 * Force SCAN_ANON if there are enough inactive
-			 * anonymous pages on the LRU in eligible zones.
-			 * Otherwise, the small LRU gets thrashed.
-			 */
-			if (!inactive_list_is_low(lruvec, false, sc, false) &&
-			    lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, sc->reclaim_idx)
-					>> sc->priority) {
-				scan_balance = SCAN_ANON;
-				goto out;
-			}
-		}
+	if (sc->file_is_tiny &&
+	    !inactive_list_is_low(lruvec, false, sc, false) &&
+	    lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,
+			    sc->reclaim_idx) >> sc->priority) {
+		scan_balance = SCAN_ANON;
+		goto out;
 	}
 
 	/*
@@ -2534,6 +2508,36 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	nr_reclaimed = sc->nr_reclaimed;
 	nr_scanned = sc->nr_scanned;
 
+	/*
+	 * Prevent the reclaimer from falling into the cache trap: as
+	 * cache pages start out inactive, every cache fault will tip
+	 * the scan balance towards the file LRU.  And as the file LRU
+	 * shrinks, so does the window for rotation from references.
+	 * This means we have a runaway feedback loop where a tiny
+	 * thrashing file LRU becomes infinitely more attractive than
+	 * anon pages.  Try to detect this based on file LRU size.
+	 */
+	if (!cgroup_reclaim(sc)) {
+		unsigned long file;
+		unsigned long free;
+		int z;
+		unsigned long total_high_wmark = 0;
+
+		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
+		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
+			   node_page_state(pgdat, NR_INACTIVE_FILE);
+
+		for (z = 0; z < MAX_NR_ZONES; z++) {
+			struct zone *zone = &pgdat->node_zones[z];
+			if (!managed_zone(zone))
+				continue;
+
+			total_high_wmark += high_wmark_pages(zone);
+		}
+
+		sc->file_is_tiny = file + free <= total_high_wmark;
+	}
+
 	shrink_node_memcgs(pgdat, sc);
 
 	if (!cgroup_reclaim(sc))