[Devel] [PATCH RHEL10 COMMIT] x86/mm/pat: take cpa_lock around large-page collapse

Konstantin Khorenko khorenko at virtuozzo.com
Mon Jun 29 21:03:57 MSK 2026


The commit is pushed to "branch-rh10-6.12.0-211.16.1.12.x.vz10-ovz" and will appear at git at bitbucket.org:openvz/vzkernel.git
after rh10-6.12.0-211.16.1.12.7.vz10
------>
commit a890cd35ddaa102721edb08e074ec56d83c18384
Author: Denis V. Lunev <den at openvz.org>
Date:   Mon Jun 29 14:33:34 2026 +0200

    x86/mm/pat: take cpa_lock around large-page collapse
    
    Loading and unloading modules concurrently on several CPUs on a KASAN
    build, with a short delay injected at the CPA page-table lookup to
    widen the window, faults within minutes:
    
      BUG: KASAN: use-after-free in __change_page_attr+0x7cc/0x7e0
      Write of size 8 at addr ffff888181139718 by task modprobe
      ...
      The buggy address belongs to the physical page:
       pfn:0x181139 ... page_type: f2(table)
    
    cpa_collapse_large_pages() rebuilds a leaf PMD from its 4K PTEs and
    frees the old PTE-table pages, while __change_page_attr() fetches a
    PTE pointer from a lockless lookup_address_in_pgd_attr() and writes
    it with set_pte_atomic() only later. When module text is served from
    a shared large ROX mapping the two run on the same PMD:
    
      CPU A (module load)              CPU B (module finalize)
      -------------------              -----------------------
      execmem_make_temp_rw
       set_memory_nx
        __change_page_attr
         split 2M -> 4K table P
         kpte = &P[i]  (lockless)
                                       execmem_restore_rox
                                        set_memory_rox (CPA_COLLAPSE)
                                         cpa_collapse_large_pages
                                          rebuild leaf PMD
                                          flush_tlb_all
                                          __free_page(P)
         set_pte_atomic(kpte, ...)
           -> writes into freed P
    
    P is a page-table page (page_type: table), reused at once, so the
    write corrupts whatever got the page next: a bad-pte or bad-page
    splat, or a fatal fault once P has been turned into read-only text.
    
    The flush_tlb_all() before the free does not close this: its IPI only
    serializes against page-table walkers that run with interrupts off
    (e.g. GUP-fast); the walk in __change_page_attr() runs with interrupts
    on, so nothing stops it from holding a stale pointer into P.
    
    Serialize the collapse - the PMD rebuild, TLB flush and PTE-table
    free - under cpa_lock, the lock __change_page_attr() takes for the
    split path, so a concurrent walker can no longer hold a pointer into
    a table the collapse is about to free.
    
    debug_pagealloc bypasses cpa_lock in __change_page_attr() (the direct
    map is 4K then, with no large pages to serialize), so the lock cannot
    order the two there. Skip the collapse in that config: it is only an
    optimization, and not freeing the tables leaves the unserialized walk
    nothing to race.
    
    With the fix the same stress runs cleanly for a prolonged period.
    
    Link: https://lkml.org/lkml/2026/6/26/1576
    Link: https://sashiko.dev/#/patchset/20260626163213.2284080-1-den%40openvz.org
    
    Feature: fix ms/mm
    https://virtuozzo.atlassian.net/browse/VSTOR-136104
    Fixes: 41d88484c71c ("x86/mm/pat: restore large ROX pages after fragmentation")
    Signed-off-by: Denis V. Lunev <den at openvz.org>
---
 arch/x86/mm/pat/set_memory.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index fa853a03a40f0..2b481a6a3699e 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -418,6 +418,16 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
 	int collapsed = 0;
 	int i;
 
+	/*
+	 * debug_pagealloc bypasses cpa_lock, so __change_page_attr() walks
+	 * unserialized and freeing collapsed PTE-tables could race it; skip
+	 * the optional merge there.
+	 */
+	if (debug_pagealloc_enabled())
+		return;
+
+	spin_lock(&cpa_lock);
+
 	if (cpa->flags & (CPA_PAGES_ARRAY | CPA_ARRAY)) {
 		for (i = 0; i < cpa->numpages; i++)
 			collapsed += collapse_large_pages(__cpa_addr(cpa, i),
@@ -431,8 +441,10 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
 			collapsed += collapse_large_pages(addr, &pgtables);
 	}
 
-	if (!collapsed)
+	if (!collapsed) {
+		spin_unlock(&cpa_lock);
 		return;
+	}
 
 	flush_tlb_all();
 
@@ -440,6 +452,8 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
 		list_del(&ptdesc->pt_list);
 		__free_page(ptdesc_page(ptdesc));
 	}
+
+	spin_unlock(&cpa_lock);
 }
 
 static void cpa_flush(struct cpa_data *cpa, int cache)


More information about the Devel mailing list