[Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync

Thu Dec 1 00:06:36 PST 2016

Maxim Patlasov <mpatlasov at virtuozzo.com> writes:

> Alexey,
>
>
> You're right. And while composing the patch I well understood that it's 
> possible to rework fuse_sync_writes() using a counter instead of 
> negative bias. But the problem with flush_mtime still exists anyway. 
> Think about it: we firstly acquire local mtime from local inode, then 
> fill and submit mtime-update-request. Since then, we don't know when 
> exactly fuse daemon will apply that new mtime to its metadata 
> structures. If another mtime-update is generated in-between (e.g. "touch 
> -d <date> file", or even simplier -- just a single direct write 
> implicitly updating mtime), we wouldn't know which of those two 
> mtime-update-requests are processed by fused first. That comes from a 
> general FUSE protocol limitation: when kernel fuse queues request A, 
> then request B, it cannot be sure if they will be processed by userspace 
> as <A, then B> or <B, then A>.
>
>
> The big advantage of the patch I sent is that it's very simple, 
> straightforward and presumably will remove 99% of contention between 
> fsync and io_submit (assuming we spend most of time waiting for 
> userspace ACK for FUSE_FSYNC request. There are actually three questions 
> to answer:

>
>
> 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a 
> lot of io_submits? The goal of fsync is to ensure that some state is 
> actually went to platters. An app who races io_submit-s with fsync-s 
> actually doesn't care which state will come to platters. I'm not sure 
> that it's reasonable to work very hard to achieve the best possible 
> performance for such a marginal app.
Obiously any filesystem behave like this.
Task A(mail-server) may perform write/fsync, task B(mysql) do a lot of io_submit-s
All that io may happens in parallel, fs guarantee only that metadata
will be serialized. So all that concurent IO flowa to blockdevice which
does no have i_mutex so all IO indeed happen concurrently.
But when we dealt with fs-in-file (loop/ploop/qemu-nbd) we face i_mutex
on file. For general filesystem (xfs/ext4) we grab i_mutex only on write
path, fsync is lockless. But int case of fuse we artificially introduce
i_mutex inside fsync which basically kill concurrency for upper FS.
As result we have SMP scalability as we have in Linux-v2.2 with single
mutex in VFS.

BTW: I'm wondering why do we care about mtime at all. for fs-in-file
we can relax that, for example flush mtime only on fsync, and not for fdatasync.

>
>
> 2) Will the patch (in the form I sent it) break something? I think no. 
> If you know some usecase that can be broken, let's discuss it in more 
> details.
>
>
> 3) Should we expect some noticeable (or significant) improvement in 
> performance comparing fuse_fsync with no locking at all vs. the locking 
> we have with that patch applied? I tend to think that the answer is "no" 
> because handling FUSE_FSYNC is notoriously heavy-weight operation. If 
> you disagree, let's firstly measure that difference in performance 
> (simply commenting out lock/unlock(i_mutex) in fuse_fsync) and then 
> start to think if it's really worthy to fully re-work locking scheme to 
> preserve flush_mtime correctness w/o i_mutex.
>
>
> Thanks,
>
> Maxim
>
>
> On 11/30/2016 05:09 AM, Alexey Kuznetsov wrote:
>> Sorry, missed that pair fuse_set_nowrite/fuse_release_writes
>> can be done only under i_mutex.
>>
>> IMHO it is only due to bad implementation.
>> If fuse_set_nowrite would be done with separate
>> count instead of adding negative bias, it would
>> be possible.
>>
>>
>> On Wed, Nov 30, 2016 at 3:47 PM, Alexey Kuznetsov <kuznet at virtuozzo.com> wrote:
>>> Hello!
>>>
>>> I do not think you got it right.
>>>
>>> i_mutex in fsync is not about some atomicity,
>>> it is about stopping data feed while fsync is executed
>>> to prevent livelock.
>>>
>>> I cannot tell anything about mtime update, it is just some voodoo
>>> magic for me.
>>>
>>> What's about fsync semantics, I see two different ways:
>>>
>>> A.
>>>
>>> 1. Remove useless write_inode_now. Its work is done
>>>      by filemap_write_and_wait_range(), there is no need to repeat it
>>> under mutex.
>>> 2. move mutex_lock _after_  fuse_sync_writes(), which is essentially
>>>      fuse continuation forfilemap_write_and_wait_range().
>>> 3. i_mutex is preserved only around fsync call.
>>>
>>> B.
>>> 1. Remove  write_inode_now as well.
>>> 2. Remove i_mutex _completely_. (No idea about mtime voodo though)
>>> 2. Replace fuse_sync_writes() with fuse_set_nowrite()
>>>      and add release after call to FSYNC.
>>>
>>> Both prevent livelock. B is obviosly optimal.
>>>
>>> But A preserves historic fuse protocol semantics.
>>> F.e. I have no idea would user space survive truncate
>>> racing with fsync. pstorage should survice, though this
>>> path was never tested.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 30, 2016 at 4:02 AM, Maxim Patlasov <mpatlasov at virtuozzo.com> wrote:
>>>> fuse_fsync_common() does need i_mutex for fuse_sync_writes() and
>>>> fuse_flush_mtime(). But when those operations are done, it's actually
>>>> doesn't matter whether to hold the lock over fuse_request_send(FUSE_FSYNC)
>>>> or not: we ensured that all relevant data were already seen by
>>>> userspace fuse daemon, and so it will sync them (by handling FUSE_FSYNC)
>>>> anyway; if the user screws up by leaking new data updates in-between, it
>>>> is up to the user and doesn't violate fsync(2) semantics.
>>>>
>>>> https://jira.sw.ru/browse/PSBM-55919
>>>>
>>>> Signed-off-by: Maxim Patlasov <mpatlasov at virtuozzo.com>
>>>> ---
>>>>   fs/fuse/file.c |    3 +++
>>>>   1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>> index 464b2f5..559dfd9 100644
>>>> --- a/fs/fuse/file.c
>>>> +++ b/fs/fuse/file.c
>>>> @@ -697,6 +697,8 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>>>>                  goto out;
>>>>          }
>>>>
>>>> +       mutex_unlock(&inode->i_mutex);
>>>> +
>>>>          memset(&inarg, 0, sizeof(inarg));
>>>>          inarg.fh = ff->fh;
>>>>          inarg.fsync_flags = datasync ? 1 : 0;
>>>> @@ -715,6 +717,7 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>>>>                          fc->no_fsync = 1;
>>>>                  err = 0;
>>>>          }
>>>> +       return err;
>>>>   out:
>>>>          mutex_unlock(&inode->i_mutex);
>>>>          return err;
>>>>