[Devel] [PATCH RHEL7 COMMIT] fs: buffer: move allocation failure loop into the allocator

Tue Sep 8 08:30:59 PDT 2015

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.8
------>
commit acaec267698c80877d904706148bb62deef74ec3
Author: Johannes Weiner <hannes at cmpxchg.org>
Date:   Tue Sep 8 19:30:59 2015 +0400

    fs: buffer: move allocation failure loop into the allocator
    
    Buffer allocation has a very crude indefinite loop around waking the
    flusher threads and performing global NOFS direct reclaim because it can
    not handle allocation failures.
    
    The most immediate problem with this is that the allocation may fail due
    to a memory cgroup limit, where flushers + direct reclaim might not make
    any progress towards resolving the situation at all.  Because unlike the
    global case, a memory cgroup may not have any cache at all, only
    anonymous pages but no swap.  This situation will lead to a reclaim
    livelock with insane IO from waking the flushers and thrashing unrelated
    filesystem cache in a tight loop.
    
    Use __GFP_NOFAIL allocations for buffers for now.  This makes sure that
    any looping happens in the page allocator, which knows how to
    orchestrate kswapd, direct reclaim, and the flushers sensibly.  It also
    allows memory cgroups to detect allocations that can't handle failure
    and will allow them to ultimately bypass the limit if reclaim can not
    make progress.
    
    Reported-by: azurIt <azurit at pobox.sk>
    Signed-off-by: Johannes Weiner <hannes at cmpxchg.org>
    Cc: Michal Hocko <mhocko at suse.cz>
    Cc: <stable at kernel.org>
    Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
    (cherry picked from commit 84235de394d9775bfaa7fa9762a59d91fef0c1fc)
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
---
 fs/buffer.c     | 14 ++++++++++++--
 mm/memcontrol.c |  2 ++
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 2d0e29193feb..2b709d45ed6f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1012,9 +1012,19 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 	struct buffer_head *bh;
 	sector_t end_block;
 	int ret = 0;		/* Will call free_more_memory() */
+	gfp_t gfp_mask;
 
-	page = find_or_create_page(inode->i_mapping, index,
-		(mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS)|__GFP_MOVABLE);
+	gfp_mask = mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS;
+	gfp_mask |= __GFP_MOVABLE;
+	/*
+	 * XXX: __getblk_slow() can not really deal with failure and
+	 * will endlessly loop on improvised global reclaim.  Prefer
+	 * looping in the allocator rather than here, at least that
+	 * code knows what it's doing.
+	 */
+	gfp_mask |= __GFP_NOFAIL;
+
+	page = find_or_create_page(inode->i_mapping, index, gfp_mask);
 	if (!page)
 		return ret;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6a0751f878d3..1433526f6bda 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2968,6 +2968,8 @@ done:
 	return 0;
 nomem:
 	*ptr = NULL;
+	if (gfp_mask & __GFP_NOFAIL)
+		return 0;
 	return -ENOMEM;
 bypass:
 	*ptr = root_mem_cgroup;