[Devel] [PATCH 1/1] x86/mm/pat: take cpa_lock around large-page collapse #VSTOR-136104

Denis V. Lunev den at openvz.org
Mon Jun 29 15:33:34 MSK 2026


Loading and unloading modules concurrently on several CPUs on a KASAN
build, with a short delay injected at the CPA page-table lookup to
widen the window, faults within minutes:

  BUG: KASAN: use-after-free in __change_page_attr+0x7cc/0x7e0
  Write of size 8 at addr ffff888181139718 by task modprobe
  ...
  The buggy address belongs to the physical page:
   pfn:0x181139 ... page_type: f2(table)

cpa_collapse_large_pages() rebuilds a leaf PMD from its 4K PTEs and
frees the old PTE-table pages, while __change_page_attr() fetches a
PTE pointer from a lockless lookup_address_in_pgd_attr() and writes
it with set_pte_atomic() only later. When module text is served from
a shared large ROX mapping the two run on the same PMD:

  CPU A (module load)              CPU B (module finalize)
  -------------------              -----------------------
  execmem_make_temp_rw
   set_memory_nx
    __change_page_attr
     split 2M -> 4K table P
     kpte = &P[i]  (lockless)
                                   execmem_restore_rox
                                    set_memory_rox (CPA_COLLAPSE)
                                     cpa_collapse_large_pages
                                      rebuild leaf PMD
                                      flush_tlb_all
                                      __free_page(P)
     set_pte_atomic(kpte, ...)
       -> writes into freed P

P is a page-table page (page_type: table), reused at once, so the
write corrupts whatever got the page next: a bad-pte or bad-page
splat, or a fatal fault once P has been turned into read-only text.

The flush_tlb_all() before the free does not close this: its IPI only
serializes against page-table walkers that run with interrupts off
(e.g. GUP-fast); the walk in __change_page_attr() runs with interrupts
on, so nothing stops it from holding a stale pointer into P.

Serialize the collapse - the PMD rebuild, TLB flush and PTE-table
free - under cpa_lock, the lock __change_page_attr() takes for the
split path, so a concurrent walker can no longer hold a pointer into
a table the collapse is about to free.

debug_pagealloc bypasses cpa_lock in __change_page_attr() (the direct
map is 4K then, with no large pages to serialize), so the lock cannot
order the two there. Skip the collapse in that config: it is only an
optimization, and not freeing the tables leaves the unserialized walk
nothing to race.

With the fix the same stress runs cleanly for a prolonged period.

Fixes: 41d88484c71c ("x86/mm/pat: restore large ROX pages after fragmentation")
Signed-off-by: Denis V. Lunev <den at openvz.org>
---
 arch/x86/mm/pat/set_memory.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index fa853a03a40f..2b481a6a3699 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -418,6 +418,16 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
 	int collapsed = 0;
 	int i;
 
+	/*
+	 * debug_pagealloc bypasses cpa_lock, so __change_page_attr() walks
+	 * unserialized and freeing collapsed PTE-tables could race it; skip
+	 * the optional merge there.
+	 */
+	if (debug_pagealloc_enabled())
+		return;
+
+	spin_lock(&cpa_lock);
+
 	if (cpa->flags & (CPA_PAGES_ARRAY | CPA_ARRAY)) {
 		for (i = 0; i < cpa->numpages; i++)
 			collapsed += collapse_large_pages(__cpa_addr(cpa, i),
@@ -431,8 +441,10 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
 			collapsed += collapse_large_pages(addr, &pgtables);
 	}
 
-	if (!collapsed)
+	if (!collapsed) {
+		spin_unlock(&cpa_lock);
 		return;
+	}
 
 	flush_tlb_all();
 
@@ -440,6 +452,8 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
 		list_del(&ptdesc->pt_list);
 		__free_page(ptdesc_page(ptdesc));
 	}
+
+	spin_unlock(&cpa_lock);
 }
 
 static void cpa_flush(struct cpa_data *cpa, int cache)
-- 
2.53.0



More information about the Devel mailing list