[Devel] [PATCH RHEL7 COMMIT] ms: mm: mmu_gather: remove __tlb_reset_range() for force flush

Sat Jan 30 14:40:53 MSK 2021

The commit is pushed to "branch-rh7-3.10.0-1160.11.1.vz7.172.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-1160.11.1.vz7.172.10
------>
commit c6dc2591fc8cbf9cddceadea18c7f92bb1a53e6f
Author: Vasily Averin <vvs at virtuozzo.com>
Date:   Sat Jan 30 14:40:53 2021 +0300

    ms: mm: mmu_gather: remove __tlb_reset_range() for force flush
    
    backported upstream commit 7a30df49f63ad92318ddf1f7498d1129a77dd4bd
    Author: Yang Shi <yang.shi at linux.alibaba.com>
    Date:   Thu Jun 13 15:56:05 2019 -0700
    
        mm: mmu_gather: remove __tlb_reset_range() for force flush
    
        A few new fields were added to mmu_gather to make TLB flush smarter for
        huge page by telling what level of page table is changed.
    
        __tlb_reset_range() is used to reset all these page table state to
        unchanged, which is called by TLB flush for parallel mapping changes for
        the same range under non-exclusive lock (i.e.  read mmap_sem).
    
        Before commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
        munmap"), the syscalls (e.g.  MADV_DONTNEED, MADV_FREE) which may update
        PTEs in parallel don't remove page tables.  But, the forementioned
        commit may do munmap() under read mmap_sem and free page tables.  This
        may result in program hang on aarch64 reported by Jan Stancek.  The
        problem could be reproduced by his test program with slightly modified
        below.
    
        ---8<---
    
        static int map_size = 4096;
        static int num_iter = 500;
        static long threads_total;
    
        static void *distant_area;
    
        void *map_write_unmap(void *ptr)
        {
                int *fd = ptr;
                unsigned char *map_address;
                int i, j = 0;
    
                for (i = 0; i < num_iter; i++) {
                        map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ,
                                MAP_SHARED | MAP_ANONYMOUS, -1, 0);
                        if (map_address == MAP_FAILED) {
                                perror("mmap");
                                exit(1);
                        }
    
                        for (j = 0; j < map_size; j++)
                                map_address[j] = 'b';
    
                        if (munmap(map_address, map_size) == -1) {
                                perror("munmap");
                                exit(1);
                        }
                }
    
                return NULL;
        }
    
        void *dummy(void *ptr)
        {
                return NULL;
        }
    
        int main(void)
        {
                pthread_t thid[2];
    
                /* hint for mmap in map_write_unmap() */
                distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ,
                                MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
                munmap(distant_area, (size_t)DISTANT_MMAP_SIZE);
                distant_area += DISTANT_MMAP_SIZE / 2;
    
                while (1) {
                        pthread_create(&thid[0], NULL, map_write_unmap, NULL);
                        pthread_create(&thid[1], NULL, dummy, NULL);
    
                        pthread_join(thid[0], NULL);
                        pthread_join(thid[1], NULL);
                }
        }
        ---8<---
        The program may bring in parallel execution like below:
    
                t1                                        t2
        munmap(map_address)
          downgrade_write(&mm->mmap_sem);
          unmap_region()
          tlb_gather_mmu()
            inc_tlb_flush_pending(tlb->mm);
          free_pgtables()
            tlb->freed_tables = 1
            tlb->cleared_pmds = 1
    
                                                pthread_exit()
                                                madvise(thread_stack, 8M, MADV_DONTNEED)
                                                  zap_page_range()
                                                    tlb_gather_mmu()
                                                      inc_tlb_flush_pending(tlb->mm);
    
          tlb_finish_mmu()
            if (mm_tlb_flush_nested(tlb->mm))
              __tlb_reset_range()
    
        __tlb_reset_range() would reset freed_tables and cleared_* bits, but this
        may cause inconsistency for munmap() which do free page tables.  Then it
        may result in some architectures, e.g.  aarch64, may not flush TLB
        completely as expected to have stale TLB entries remained.
    
        Use fullmm flush since it yields much better performance on aarch64 and
        non-fullmm doesn't yields significant difference on x86.
    
        The original proposed fix came from Jan Stancek who mainly debugged this
        issue, I just wrapped up everything together.
    
        Jan's testing results:
    
        v5.2-rc2-24-gbec7550cca10
        --------------------------
                 mean     stddev
        real    37.382   2.780
        user     1.420   0.078
        sys     54.658   1.855
    
        v5.2-rc2-24-gbec7550cca10 + "mm: mmu_gather: remove __tlb_reset_range() for force flush"
        ---------------------------------------------------------------------------------------_
                 mean     stddev
        real    37.119   2.105
        user     1.548   0.087
        sys     55.698   1.357
    
        [akpm at linux-foundation.org: coding-style fixes]
        Link: http://lkml.kernel.org/r/1558322252-113575-1-git-send-email-yang.shi@linux.alibaba.com
        Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
        Signed-off-by: Yang Shi <yang.shi at linux.alibaba.com>
        Signed-off-by: Jan Stancek <jstancek at redhat.com>
        Reported-by: Jan Stancek <jstancek at redhat.com>
        Tested-by: Jan Stancek <jstancek at redhat.com>
        Suggested-by: Will Deacon <will.deacon at arm.com>
        Tested-by: Will Deacon <will.deacon at arm.com>
        Acked-by: Will Deacon <will.deacon at arm.com>
        Cc: Peter Zijlstra <peterz at infradead.org>
        Cc: Nick Piggin <npiggin at gmail.com>
        Cc: "Aneesh Kumar K.V" <aneesh.kumar at linux.ibm.com>
        Cc: Nadav Amit <namit at vmware.com>
        Cc: Minchan Kim <minchan at kernel.org>
        Cc: Mel Gorman <mgorman at suse.de>
        Cc: <stable at vger.kernel.org>    [4.20+]
        Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
        Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
    
    This patch fixes lost TLB flush caused userspace memory corruption after
    vz7 rebase to RHEL7.9 (i.e. on vz7.17x.y kernels)
    it looks like it does not affect previous vz7 kernels becasue issue was masked by
    presence of backported upstream commit 4647706ebeee ("mm: always flush VMA ranges
     affected by zap_page_range") that was reverted in RHEL7.9
    https://jira.sw.ru/browse/PSBM-124581
    Signed-off-by: Vasily Averin <vvs at virtuozzo.com>
---
 mm/memory.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f6017f3..099f6e7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -295,8 +295,9 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb,
 	struct mmu_gather_batch *batch, *next;
 
 	if (force) {
+		tlb->fullmm = 1;
 		__tlb_reset_range(tlb);
-		__tlb_adjust_range(tlb, start, end - start);
+		tlb->freed_tables = 1;
 	}
 
 	tlb_flush_mmu(tlb);
@@ -436,10 +437,15 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
 {
 	/*
 	 * If there are parallel threads are doing PTE changes on same range
-	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
-	 * flush by batching, a thread has stable TLB entry can fail to flush
-	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
-	 * forcefully if we detect parallel PTE batching threads.
+	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
+	 * flush by batching, one thread may end up seeing inconsistent PTEs
+	 * and result in having stale TLB entries.  So flush TLB forcefully
+	 * if we detect parallel PTE batching threads.
+	 *
+	 * However, some syscalls, e.g. munmap(), may free page tables, this
+	 * needs force flush everything in the given range. Otherwise this
+	 * may result in having stale TLB entries for some architectures,
+	 * e.g. aarch64, that could specify flush what level TLB.
 	 */
 	bool force = mm_tlb_flush_nested(tlb->mm);