[Devel] [PATCH RHEL7 COMMIT] ms/fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem

Thu Mar 16 08:21:00 PDT 2017

The commit is pushed to "branch-rh7-3.10.0-514.10.2.vz7.29.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-514.10.2.vz7.29.4
------>
commit 2d47a05314ed0fd03df75c419eeda00fab40ad2d
Author: Eric W. Biederman <ebiederm at xmission.com>
Date:   Thu Mar 16 19:21:00 2017 +0400

    ms/fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem
    
    This is a backport of upstream (vanilla) commit:
    commit 96c7a2ff21501691587e1ae969b83cbec8b78e08 ("fs/file.c:fdtable: avoid
    triggering OOMs from alloc_fdmem")
    
    Under certain conditions there might be a lot of
    alloc_fdmem() invocations with order <= PAGE_ALLOC_COSTLY_ORDER.
    
    For example: httpd which is doing a lot of fork() calls.
    
    Real-life examples from our customers:
    
    [532506.773243] httpd           D ffff8803f5fecc20     0 939874   6606
    [532506.773257] Call Trace:
    [532506.773261]  [<ffffffff8163ce29>] schedule+0x29/0x70
    [532506.773264]  [<ffffffff8163a9d5>] schedule_timeout+0x175/0x2d0
    [532506.773272]  [<ffffffff8108cc90>] ? internal_add_timer+0x70/0x70
    [532506.773276]  [<ffffffff8163c3ae>] io_schedule_timeout+0xae/0x130
    [532506.773280]  [<ffffffff8119be85>] wait_iff_congested+0x135/0x150
    [532506.773284]  [<ffffffff810a86e0>] ? wake_up_atomic_t+0x30/0x30
    [532506.773288]  [<ffffffff8119071f>] shrink_inactive_list+0x65f/0x6c0
    [532506.773292]  [<ffffffff81190f55>] shrink_lruvec+0x395/0x800
    [532506.773296]  [<ffffffff811914af>] shrink_zone+0xef/0x2d0
    [532506.773300]  [<ffffffff81191a30>] do_try_to_free_pages+0x170/0x530
    [532506.773310]  [<ffffffff81191ec5>] try_to_free_pages+0xd5/0x160
    [532506.773315]  [<ffffffff811850ab>] __alloc_pages_nodemask+0x8ab/0xc10
    [532506.773320]  [<ffffffff811cb2f9>] alloc_pages_current+0xa9/0x170
    [532506.773324]  [<ffffffff8119f8f8>] kmalloc_order+0x18/0x50
    [532506.773327]  [<ffffffff8119f956>] kmalloc_order_trace+0x26/0xa0
    [532506.773332]  [<ffffffff811d8c69>] __kmalloc+0x259/0x270
    [532506.773337]  [<ffffffff812184d0>] alloc_fdmem+0x20/0x50
    [532506.773341]  [<ffffffff812185ac>] alloc_fdtable+0x6c/0xe0
    [532506.773344]  [<ffffffff81218b69>] dup_fd+0x1f9/0x2d0
    [532506.773354]  [<ffffffff810797cf>] copy_process.part.30+0x87f/0x1510
    [532506.773358]  [<ffffffff8107a641>] do_fork+0xe1/0x320
    [532506.773370]  [<ffffffff8107a906>] SyS_clone+0x16/0x20
    [532506.773376]  [<ffffffff81648299>] stub_clone+0x69/0x90
    [532506.773380]  [<ffffffff81647f49>] ? system_call_fastpath+0x16/0x1b
    
    [513890.005271] httpd           D ffff880425db7230     0 811718   6606
    [513890.005279] Call Trace:
    [513890.005282]  [<ffffffff8163ce29>] schedule+0x29/0x70
    [513890.005284]  [<ffffffff8163aa99>] schedule_timeout+0x239/0x2d0
    [513890.005292]  [<ffffffff8163c3ae>] io_schedule_timeout+0xae/0x130
    [513890.005296]  [<ffffffff8163c448>] io_schedule+0x18/0x20
    [513890.005298]  [<ffffffff812c6268>] get_request+0x218/0x780
    [513890.005303]  [<ffffffff812c8526>] blk_queue_bio+0xc6/0x3a0
    [513890.005309]  [<ffffffffa0002c59>] ? dm_make_request+0x119/0x170 [dm_mod]
    [513890.005311]  [<ffffffff812c3892>] generic_make_request+0xe2/0x130
    [513890.005313]  [<ffffffff812c3957>] submit_bio+0x77/0x1c0
    [513890.005318]  [<ffffffff811bf87e>] __swap_writepage+0x1be/0x260
    [513890.005337]  [<ffffffff811bf959>] swap_writepage+0x39/0x80
    [513890.005340]  [<ffffffff8118f68d>] shrink_page_list+0x4ad/0xa80
    [513890.005343]  [<ffffffff811902bb>] shrink_inactive_list+0x1fb/0x6c0
    [513890.005345]  [<ffffffff81190f55>] shrink_lruvec+0x395/0x800
    [513890.005348]  [<ffffffff811914af>] shrink_zone+0xef/0x2d0
    [513890.005350]  [<ffffffff81191a30>] do_try_to_free_pages+0x170/0x530
    [513890.005353]  [<ffffffff81191ec5>] try_to_free_pages+0xd5/0x160
    [513890.005355]  [<ffffffff811850ab>] __alloc_pages_nodemask+0x8ab/0xc10
    [513890.005358]  [<ffffffff811cb2f9>] alloc_pages_current+0xa9/0x170
    [513890.005360]  [<ffffffff8119f8f8>] kmalloc_order+0x18/0x50
    [513890.005362]  [<ffffffff8119f956>] kmalloc_order_trace+0x26/0xa0
    [513890.005365]  [<ffffffff811d8c69>] __kmalloc+0x259/0x270
    [513890.005367]  [<ffffffff812184d0>] alloc_fdmem+0x20/0x50
    [513890.005369]  [<ffffffff812185ac>] alloc_fdtable+0x6c/0xe0
    [513890.005371]  [<ffffffff81218b69>] dup_fd+0x1f9/0x2d0
    [513890.005376]  [<ffffffff810797cf>] copy_process.part.30+0x87f/0x1510
    [513890.005378]  [<ffffffff8107a641>] do_fork+0xe1/0x320
    [513890.005380]  [<ffffffff8107a906>] SyS_clone+0x16/0x20
    [513890.005382]  [<ffffffff81648299>] stub_clone+0x69/0x90
    
    We observed that sometimes kswapd cannot handle this which
    causes many direct reclaim attempts which in turn:
    
    1. Increases iowait time due to congestion_wait
    2. Increases number of block reqs per second due to
    page swapping and writeback
    3. May induce OOMs
    
    So it's better DO NOT try that hard to allocate contiguous
    area, and fallback to vmalloc() as soon as possible.
    
    =========================================================
    Original commit message:
    
        fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem
    
        Recently due to a spike in connections per second memcached on 3
        separate boxes triggered the OOM killer from accept.  At the time the
        OOM killer was triggered there was 4GB out of 36GB free in zone 1.  The
        problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
        hold a bitmap, and there was sufficient fragmentation that the largest
        page available was 8KiB.
    
        I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
        but I do agree that order 3 allocations are very likely to succeed.
    
        There are always pathologies where order > 0 allocations can fail when
        there are copious amounts of free memory available.  Using the pigeon
        hole principle it is easy to show that it requires 1 page more than 50%
        of the pages being free to guarantee an order 1 (8KiB) allocation will
        succeed, 1 page more than 75% of the pages being free to guarantee an
        order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
        the pages being free to guarantee an order 3 allocate will succeed.
    
        A server churning memory with a lot of small requests and replies like
        memcached is a common case that if anything can will skew the odds
        against large pages being available.
    
        Therefore let's not give external applications a practical way to kill
        linux server applications, and specify __GFP_NORETRY to the kmalloc in
        alloc_fdmem.  Unless I am misreading the code and by the time the code
        reaches should_alloc_retry in __alloc_pages_slowpath (where
        __GFP_NORETRY becomes signification).  We have already tried everything
        reasonable to allocate a page and the only thing left to do is wait.  So
        not waiting and falling back to vmalloc immediately seems like the
        reasonable thing to do even if there wasn't a chance of triggering the
        OOM killer.
    
        Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
        Cc: Eric Dumazet <eric.dumazet at gmail.com>
        Acked-by: David Rientjes <rientjes at google.com>
        Cc: Cong Wang <cwang at twopensource.com>
        Cc: <stable at vger.kernel.org>
        Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
        Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
    
    Signed-off-by: Anatoly Stepanov <astepanov at cloudlinux.com>
    Acked-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
---
 fs/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index c1e802c..1db5952 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -37,7 +37,7 @@ static void *alloc_fdmem(size_t size)
 	 * vmalloc() if the allocation size will be considered "large" by the VM.
 	 */
 	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-		void *data = kmalloc(size, GFP_KERNEL_ACCOUNT|__GFP_NOWARN);
+		void *data = kmalloc(size, GFP_KERNEL_ACCOUNT|__GFP_NOWARN|__GFP_NORETRY);
 		if (data != NULL)
 			return data;
 	}