[Devel] [PATCH RHEL7 COMMIT] ms/shmem: fix shm fallocate() list corruption

Fri Mar 31 07:27:52 PDT 2017

The commit is pushed to "branch-rh7-3.10.0-514.10.2.vz7.29.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-514.10.2.vz7.29.9
------>
commit 8dcde8690897e05676bbe0870a9873105a3bbd36
Author: Linus Torvalds <torvalds at linux-foundation.org>
Date:   Fri Mar 31 18:27:52 2017 +0400

    ms/shmem: fix shm fallocate() list corruption
    
    ms commit: 10d20bd ("shmem: fix shm fallocate() list corruption")
    
    The shmem hole punching with fallocate(FALLOC_FL_PUNCH_HOLE) does not
    want to race with generating new pages by faulting them in.
    
    However, the wait-queue used to delay the page faulting has a serious
    problem: the wait queue head (in shmem_fallocate()) is allocated on the
    stack, and the code expects that "wake_up_all()" will make sure that all
    the queue entries are gone before the stack frame is de-allocated.
    
    And that is not at all necessarily the case.
    
    Yes, a normal wake-up sequence will remove the wait-queue entry that
    caused the wakeup (see "autoremove_wake_function()"), but the key
    wording there is "that caused the wakeup".  When there are multiple
    possible wakeup sources, the wait queue entry may well stay around.
    
    And _particularly_ in a page fault path, we may be faulting in new pages
    from user space while we also have other things going on, and there may
    well be other pending wakeups.
    
    So despite the "wake_up_all()", it's not at all guaranteed that all list
    entries are removed from the wait queue head on the stack.
    
    Fix this by introducing a new wakeup function that removes the list
    entry unconditionally, even if the target process had already woken up
    for other reasons.  Use that "synchronous" function to set up the
    waiters in shmem_fault().
    
    This problem has never been seen in the wild afaik, but Dave Jones has
    reported it on and off while running trinity.  We thought we fixed the
    stack corruption with the blk-mq rq_list locking fix (commit
    7fe311302f7d: "blk-mq: update hardware and software queues for sleeping
    alloc"), but it turns out there was _another_ stack corruptor hiding
    in the trinity runs.
    
    Vegard Nossum (also running trinity) was able to trigger this one fairly
    consistently, and made us look once again at the shmem code due to the
    faults often being in that area.
    
    Reported-and-tested-by: Vegard Nossum <vegard.nossum at oracle.com>.
    Reported-by: Dave Jones <davej at codemonkey.org.uk>
    Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
    
    https://jira.sw.ru/browse/PSBM-56255
    
    Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
---
 mm/shmem.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index d2bda92..db31c7a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1399,6 +1399,18 @@ unlock:
 	return error;
 }
 
+/*
+ * This is like autoremove_wake_function, but it removes the wait queue
+ * entry unconditionally - even if something else had already woken the
+ * target.
+ */
+static int synchronous_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	int ret = default_wake_function(wait, mode, sync, key);
+	list_del_init(&wait->task_list);
+	return ret;
+}
+
 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct inode *inode = file_inode(vma->vm_file);
@@ -1432,7 +1444,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		    vmf->pgoff >= shmem_falloc->start &&
 		    vmf->pgoff < shmem_falloc->next) {
 			wait_queue_head_t *shmem_falloc_waitq;
-			DEFINE_WAIT(shmem_fault_wait);
+			DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
 
 			ret = VM_FAULT_NOPAGE;
 			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
@@ -2254,6 +2266,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		spin_lock(&inode->i_lock);
 		inode->i_private = NULL;
 		wake_up_all(&shmem_falloc_waitq);
+		WARN_ON_ONCE(!list_empty(&shmem_falloc_waitq.task_list));
 		spin_unlock(&inode->i_lock);
 		error = 0;
 		goto out;