[Devel] [PATCH RHEL7 COMMIT] ms/mm, vmscan: do not loop on too_many_isolated for ever

Thu Oct 18 12:33:13 MSK 2018

The commit is pushed to "branch-rh7-3.10.0-862.14.4.vz7.72.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-862.14.4.vz7.72.11
------>
commit a1d083f1a16ce0dc3a430046ca4e3616310c6e93
Author: Michal Hocko <mhocko at suse.com>
Date:   Wed Sep 6 16:21:11 2017 -0700

    ms/mm, vmscan: do not loop on too_many_isolated for ever
    
    Tetsuo Handa has reported[1][2][3] that direct reclaimers might get
    stuck in too_many_isolated loop basically for ever because the last few
    pages on the LRU lists are isolated by the kswapd which is stuck on fs
    locks when doing the pageout or slab reclaim.  This in turn means that
    there is nobody to actually trigger the oom killer and the system is
    basically unusable.
    
    too_many_isolated has been introduced by commit 35cd78156c49 ("vmscan:
    throttle direct reclaim when too many pages are isolated already") to
    prevent from pre-mature oom killer invocations because back then no
    reclaim progress could indeed trigger the OOM killer too early.
    
    But since the oom detection rework in commit 0a0337e0d1d1 ("mm, oom:
    rework oom detection") the allocation/reclaim retry loop considers all
    the reclaimable pages and throttles the allocation at that layer so we
    can loosen the direct reclaim throttling.
    
    Make shrink_inactive_list loop over too_many_isolated bounded and
    returns immediately when the situation hasn't resolved after the first
    sleep.
    
    Replace congestion_wait by a simple schedule_timeout_interruptible
    because we are not really waiting on the IO congestion in this path.
    
    Please note that this patch can theoretically cause the OOM killer to
    trigger earlier while there are many pages isolated for the reclaim
    which makes progress only very slowly.  This would be obvious from the
    oom report as the number of isolated pages are printed there.  If we
    ever hit this should_reclaim_retry should consider those numbers in the
    evaluation in one way or another.
    
    [1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
    [2] http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp
    [3] http://lkml.kernel.org/r/201706300914.CEH95859.FMQOLVFHJFtOOS@I-love.SAKURA.ne.jp
    
    [mhocko at suse.com: switch to uninterruptible sleep]
      Link: http://lkml.kernel.org/r/20170724065048.GB25221@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170710074842.23175-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko <mhocko at suse.com>
    Reported-by: Tetsuo Handa <penguin-kernel at I-love.SAKURA.ne.jp>
    Tested-by: Tetsuo Handa <penguin-kernel at I-love.SAKURA.ne.jp>
    Acked-by: Mel Gorman <mgorman at suse.de>
    Acked-by: Vlastimil Babka <vbabka at suse.cz>
    Acked-by: Rik van Riel <riel at redhat.com>
    Acked-by: Johannes Weiner <hannes at cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
    
    https://jira.sw.ru/browse/PSBM-89512
    https://access.redhat.com/solutions/3538691
    
    (cherry picked from commit db73ee0d463799223244e96e7b7eea73b4a6ec31)
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
---
 mm/vmscan.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6be538ce81b6..2481caa15ec1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1702,9 +1702,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int file = is_file_lru(lru);
 	struct zone *zone = lruvec_zone(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	bool stalled = false;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		if (stalled)
+			return 0;
+
+		/* wait a bit for the reclaimer. */
+		msleep(100);
+		stalled = true;
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))