[Devel] [PATCH RHEL9 COMMIT] oracle/mm: avoid early cow when copying ptes for MADV_DOEXEC

Konstantin Khorenko khorenko at virtuozzo.com
Thu Jan 23 23:35:47 MSK 2025


The commit is pushed to "branch-rh9-5.14.0-427.44.1.vz9.80.x-ovz" and will appear at git at bitbucket.org:openvz/vzkernel.git
after rh9-5.14.0-427.44.1.vz9.80.5
------>
commit a8cc8c6ac35ccc81b8c16425596cac77df058dee
Author: Anthony Yznaga <anthony.yznaga at oracle.com>
Date:   Thu Jan 26 15:41:44 2023 -0800

    oracle/mm: avoid early cow when copying ptes for MADV_DOEXEC
    
    When a VMA preserved via MADV_DOEXEC is copied to the new mm during
    exec, copy_page_range() is called to copy the pagetable entries.
    Commit 70e806e4 ("mm: Do early cow for pinned pages during fork()
    for ptes") changed how pinned pages encountered by copy_page_range()
    are handled. A copy of the page is made immediately rather than
    write-protecting it for later COW. This breaks MADV_DOEXEC when the
    memory to preserve is pinned (e.g. the guest memory of a VFIO-enabled
    guest. Ensure that this page copying will not be done when copying
    pagetable entries for preservation by adding a check for VM_EXEC_KEEP.
    
    Orabug: 35054621
    Signed-off-by: Anthony Yznaga <anthony.yznaga at oracle.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett at oracle.com>
    
    https://virtuozzo.atlassian.net/browse/VSTOR-96305
    
    Porting notes:
    
    RedHat has applied
    rh commit: d8f21270d397 ("mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()")
    ms commit: fb3d824d1a46 ("mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()")
    
    and
    rh commit: 85f85f728ec6 ("mm/memory: slightly simplify copy_present_pte()")
    ms commit: b51ad4f8679e ("mm/memory: slightly simplify copy_present_pte()")
    
    So the check from the copy_present_page() has been moved to
    copy_present_pte().
    
    (cherry picked from Oracle commit a904d4d4c24126a64b6d8aa0658425f4964ce674)
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    Feature: oracle/mm: MADV_DOEXEC madvise() flag
---
 mm/memory.c |  6 +++++-
 mm/mmap.c   | 10 +++++++---
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 88b1aead060f..ebd08a1f2c9a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -915,9 +915,12 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	unsigned long vm_flags = src_vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
+	bool is_exec_keep;
 
 	page = vm_normal_page(src_vma, addr, pte);
 	if (page && PageAnon(page)) {
+		is_exec_keep = dst_vma->vm_flags & VM_EXEC_KEEP ? true : false;
+
 		/*
 		 * If this page may have been pinned by the parent process,
 		 * copy the page immediately for the child so that we'll always
@@ -925,7 +928,8 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		 * future.
 		 */
 		get_page(page);
-		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
+		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma)) &&
+		    !is_exec_keep) {
 			/* Page maybe pinned, we have to copy. */
 			put_page(page);
 			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
diff --git a/mm/mmap.c b/mm/mmap.c
index f87d284bd17b..9bb2382d9101 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3300,10 +3300,11 @@ int vma_dup(struct vm_area_struct *old_vma, struct mm_struct *mm)
 
 	/*
 	 * Clear functionality that should not carry over to the new
-	 * process.any memory locking, userfaultfd, and preservation over
-	 * exec flags.
+	 * process. Note that VM_EXEC_KEEP is cleared later to allow
+	 * code called by copy_page_range to infer that the copying is
+	 * for preserving over exec and not for process forking.
 	 */
-	vma->vm_flags &= ~(VM_LOCKED|VM_LOCKONFAULT|VM_UFFD_MISSING|VM_UFFD_WP|VM_EXEC_KEEP);
+	vma->vm_flags &= ~(VM_LOCKED|VM_LOCKONFAULT|VM_UFFD_MISSING|VM_UFFD_WP);
 	vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 
 	__insert_vm_struct(mm, vma);
@@ -3318,6 +3319,9 @@ int vma_dup(struct vm_area_struct *old_vma, struct mm_struct *mm)
 	old_vma->vm_flags &= ~VM_ACCOUNT;
 
 	ret = copy_page_range(vma, old_vma);
+
+	vma->vm_flags &= ~VM_EXEC_KEEP;
+
 	return ret;
 
 fail_nomem_anon_vma_fork:


More information about the Devel mailing list