[Devel] [PATCH RHEL7 COMMIT] ms/fs: Fix race when checking i_size on direct i/o read

Mon Apr 10 02:34:18 PDT 2017

The commit is pushed to "branch-rh7-3.10.0-514.10.2.vz7.29.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-514.10.2.vz7.29.14
------>
commit e739c50cf51fd9ee3112356eb427f3ee5b59aebc
Author: Steven Whitehouse <swhiteho at redhat.com>
Date:   Fri Apr 7 16:36:56 2017 +0400

    ms/fs: Fix race when checking i_size on direct i/o read
    
    Backport 9fe55eea from ml. It's important because:
    
    > The problem is that everyone reads until EOF indication.
    > So, the last read always reads beyond EOF and
    > generic_file_read_iter() in this case falls back to cached read
    > and hangs there waiting for background request.
    
    ===
    khorenko@: imagine a situation: pStorage mount gets stuck, aio requests
    are not processed => all possible requests are occupied =>
    no more requests are possible, cached reads block because of
    too big dirty memory, even status utilities don't work - they
    cannot read status files. Getting status, investigation, debugging
    is difficult under such circumstances.
    
    The tweak which significantly helps debugging is to use direct read
    which does not block, but this weirdness with reading beyond EOF
    which results in cached read - which again blocks - makes the tweak
    non-working. The patch makes direct read really non-blocking.
    ===
    
    Original patch description:
    
    commit 9fe55eea7e4b444bafc42fa0000cc2d1d2847275
    Author: Steven Whitehouse <swhiteho at redhat.com>
    Date:   Fri Jan 24 14:42:22 2014 +0000
    
        Fix race when checking i_size on direct i/o read
    
        So far I've had one ACK for this, and no other comments. So I think it
        is probably time to send this via some suitable tree. I'm guessing that
        the vfs tree would be the most appropriate route, but not sure that
        there is one at the moment (don't see anything recent at kernel.org)
        so in that case I think -mm is the "back up plan". Al, please let me
        know if you will take this?
    
        Steve.
    
        ---------------------
    
        Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
        reads with append dio writes" thread on linux-fsdevel, this patch is my
        current version of the fix proposed as option (b) in that thread.
    
        Removing the i_size test from the direct i/o read path at vfs level
        means that filesystems now have to deal with requests which are beyond
        i_size themselves. These I've divided into three sets:
    
         a) Those with "no op" ->direct_IO (9p, cifs, ceph)
        These are obviously not going to be an issue
    
         b) Those with "home brew" ->direct_IO (nfs, fuse)
        I've been told that NFS should not have any problem with the larger
        i_size, however I've added an extra test to FUSE to duplicate the
        original behaviour just to be on the safe side.
    
         c) Those using __blockdev_direct_IO()
        These call through to ->get_block() which should deal with the EOF
        condition correctly. I've verified that with GFS2 and I believe that
        Zheng has verified it for ext4. I've also run the test on XFS and it
        passes both before and after this change.
    
        The part of the patch in filemap.c looks a lot larger than it really is
        - there are only two lines of real change. The rest is just indentation
        of the contained code.
    
        There remains a test of i_size though, which was added for btrfs. It
        doesn't cause the other filesystems a problem as the test is performed
        after ->direct_IO has been called. It is possible that there is a race
        that does matter to btrfs, however this patch doesn't change that, so
        its still an overall improvement.
    
        Signed-off-by: Steven Whitehouse <swhiteho at redhat.com>
        Reported-by: Zheng Liu <gnehzuil.liu at gmail.com>
        Cc: Jan Kara <jack at suse.cz>
    
    Signed-off-by: Maxim Patlasov <mpatlasov at virtuozzo.com>
---
 fs/fuse/file.c |  3 +++
 mm/filemap.c   | 44 +++++++++++++++++++++-----------------------
 2 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 41ed6f0..37384e2 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3395,6 +3395,9 @@ fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	inode = file->f_mapping->host;
 	i_size = i_size_read(inode);
 
+	if ((rw == READ) && (offset > i_size))
+		return 0;
+
 	/* optimization for short read */
 	if (async_dio && rw != WRITE && offset + count > i_size) {
 		loff_t new_count;
diff --git a/mm/filemap.c b/mm/filemap.c
index 16517c6..1eebc4a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1922,30 +1922,28 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter, loff_t pos)
 		if (!count)
 			goto out; /* skip atime */
 		size = i_size_read(inode);
-		if (pos < size) {
-			retval = filemap_write_and_wait_range(mapping, pos,
-					pos + count - 1);
-			if (!retval) {
-				retval = mapping_direct_IO(mapping, READ,
-							   iocb, iter, pos);
-			}
-			if (retval > 0) {
-				*ppos = pos + retval;
-				count -= retval;
-			}
+		retval = filemap_write_and_wait_range(mapping, pos,
+				pos + count - 1);
+		if (!retval) {
+			retval = mapping_direct_IO(mapping, READ,
+						   iocb, iter, pos);
+		}
+		if (retval > 0) {
+			*ppos = pos + retval;
+			count -= retval;
+		}
 
-			/*
-			 * Btrfs can have a short DIO read if we encounter
-			 * compressed extents, so if there was an error, or if
-			 * we've already read everything we wanted to, or if
-			 * there was a short read because we hit EOF, go ahead
-			 * and return.  Otherwise fallthrough to buffered io for
-			 * the rest of the read.
-			 */
-			if (retval < 0 || !count || *ppos >= size) {
-				file_accessed(filp);
-				goto out;
-			}
+		/*
+		 * Btrfs can have a short DIO read if we encounter
+		 * compressed extents, so if there was an error, or if
+		 * we've already read everything we wanted to, or if
+		 * there was a short read because we hit EOF, go ahead
+		 * and return.  Otherwise fallthrough to buffered io for
+		 * the rest of the read.
+		 */
+		if (retval < 0 || !count || *ppos >= size) {
+			file_accessed(filp);
+			goto out;
 		}
 
 		/*