[Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync

Thu Dec 1 11:09:44 PST 2016

On 12/01/2016 12:06 AM, Dmitry Monakhov wrote:

> Maxim Patlasov <mpatlasov at virtuozzo.com> writes:
>
>> Alexey,
>>
>>
>> You're right. And while composing the patch I well understood that it's
>> possible to rework fuse_sync_writes() using a counter instead of
>> negative bias. But the problem with flush_mtime still exists anyway.
>> Think about it: we firstly acquire local mtime from local inode, then
>> fill and submit mtime-update-request. Since then, we don't know when
>> exactly fuse daemon will apply that new mtime to its metadata
>> structures. If another mtime-update is generated in-between (e.g. "touch
>> -d <date> file", or even simplier -- just a single direct write
>> implicitly updating mtime), we wouldn't know which of those two
>> mtime-update-requests are processed by fused first. That comes from a
>> general FUSE protocol limitation: when kernel fuse queues request A,
>> then request B, it cannot be sure if they will be processed by userspace
>> as <A, then B> or <B, then A>.
>>
>>
>> The big advantage of the patch I sent is that it's very simple,
>> straightforward and presumably will remove 99% of contention between
>> fsync and io_submit (assuming we spend most of time waiting for
>> userspace ACK for FUSE_FSYNC request. There are actually three questions
>> to answer:
>>
>> 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a
>> lot of io_submits? The goal of fsync is to ensure that some state is
>> actually went to platters. An app who races io_submit-s with fsync-s
>> actually doesn't care which state will come to platters. I'm not sure
>> that it's reasonable to work very hard to achieve the best possible
>> performance for such a marginal app.
> Obiously any filesystem behave like this.
> Task A(mail-server) may perform write/fsync, task B(mysql) do a lot of io_submit-s
> All that io may happens in parallel, fs guarantee only that metadata
> will be serialized. So all that concurent IO flowa to blockdevice which
> does no have i_mutex so all IO indeed happen concurrently.

Looks as you're comparing an app doing POSIX open/read/write/fsync/close 
with fs doing submit_bio. This is a stretch. But OK, there is a 
similarity. But I don't think this rather vague similarity proves something.

> But when we dealt with fs-in-file (loop/ploop/qemu-nbd) we face i_mutex
> on file.

This really makes sense. If an app inside a VM loops over ordinary 
direct writes, while another app (in the same VM) does fsync, it's not 
fair to suspend the first app for long while just because fuse holds 
i_mutex for long somewhere deep in fuse_fsync.

> For general filesystem (xfs/ext4)

What makes xfs/ext4 general? They are *local* fs having almost direct 
access to underlying HDD. Have you explored how CEPH implements fsync?

> we grab i_mutex only on write
> path, fsync is lockless.

What about the following part of ext4_sync_file:

>     if (!journal) {
>         ret = generic_file_fsync(file, start, end, datasync);
? Is it lockless?

> But int case of fuse we artificially introduce
> i_mutex inside fsync

exactly the same as generic_file_fsync()

> which basically kill concurrency for upper FS.
> As result we have SMP scalability as we have in Linux-v2.2 with single
> mutex in VFS.

You can't derive "kill concurrency" from "introduce i_mutex" as simple 
as that. It really does matter for how long the mutex is held and how 
often it contends with writes.

>
> BTW: I'm wondering why do we care about mtime at all. for fs-in-file
> we can relax that, for example flush mtime only on fsync, and not for fdatasync.

Good catch, makes sense!

>
>>
>> 2) Will the patch (in the form I sent it) break something? I think no.
>> If you know some usecase that can be broken, let's discuss it in more
>> details.
>>
>>
>> 3) Should we expect some noticeable (or significant) improvement in
>> performance comparing fuse_fsync with no locking at all vs. the locking
>> we have with that patch applied? I tend to think that the answer is "no"
>> because handling FUSE_FSYNC is notoriously heavy-weight operation. If
>> you disagree, let's firstly measure that difference in performance
>> (simply commenting out lock/unlock(i_mutex) in fuse_fsync) and then
>> start to think if it's really worthy to fully re-work locking scheme to
>> preserve flush_mtime correctness w/o i_mutex.
>>
>>
>> Thanks,
>>
>> Maxim
>>
>>
>> On 11/30/2016 05:09 AM, Alexey Kuznetsov wrote:
>>> Sorry, missed that pair fuse_set_nowrite/fuse_release_writes
>>> can be done only under i_mutex.
>>>
>>> IMHO it is only due to bad implementation.
>>> If fuse_set_nowrite would be done with separate
>>> count instead of adding negative bias, it would
>>> be possible.
>>>
>>>
>>> On Wed, Nov 30, 2016 at 3:47 PM, Alexey Kuznetsov <kuznet at virtuozzo.com> wrote:
>>>> Hello!
>>>>
>>>> I do not think you got it right.
>>>>
>>>> i_mutex in fsync is not about some atomicity,
>>>> it is about stopping data feed while fsync is executed
>>>> to prevent livelock.
>>>>
>>>> I cannot tell anything about mtime update, it is just some voodoo
>>>> magic for me.
>>>>
>>>> What's about fsync semantics, I see two different ways:
>>>>
>>>> A.
>>>>
>>>> 1. Remove useless write_inode_now. Its work is done
>>>>       by filemap_write_and_wait_range(), there is no need to repeat it
>>>> under mutex.
>>>> 2. move mutex_lock _after_  fuse_sync_writes(), which is essentially
>>>>       fuse continuation forfilemap_write_and_wait_range().
>>>> 3. i_mutex is preserved only around fsync call.
>>>>
>>>> B.
>>>> 1. Remove  write_inode_now as well.
>>>> 2. Remove i_mutex _completely_. (No idea about mtime voodo though)
>>>> 2. Replace fuse_sync_writes() with fuse_set_nowrite()
>>>>       and add release after call to FSYNC.
>>>>
>>>> Both prevent livelock. B is obviosly optimal.
>>>>
>>>> But A preserves historic fuse protocol semantics.
>>>> F.e. I have no idea would user space survive truncate
>>>> racing with fsync. pstorage should survice, though this
>>>> path was never tested.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Nov 30, 2016 at 4:02 AM, Maxim Patlasov <mpatlasov at virtuozzo.com> wrote:
>>>>> fuse_fsync_common() does need i_mutex for fuse_sync_writes() and
>>>>> fuse_flush_mtime(). But when those operations are done, it's actually
>>>>> doesn't matter whether to hold the lock over fuse_request_send(FUSE_FSYNC)
>>>>> or not: we ensured that all relevant data were already seen by
>>>>> userspace fuse daemon, and so it will sync them (by handling FUSE_FSYNC)
>>>>> anyway; if the user screws up by leaking new data updates in-between, it
>>>>> is up to the user and doesn't violate fsync(2) semantics.
>>>>>
>>>>> https://jira.sw.ru/browse/PSBM-55919
>>>>>
>>>>> Signed-off-by: Maxim Patlasov <mpatlasov at virtuozzo.com>
>>>>> ---
>>>>>    fs/fuse/file.c |    3 +++
>>>>>    1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>> index 464b2f5..559dfd9 100644
>>>>> --- a/fs/fuse/file.c
>>>>> +++ b/fs/fuse/file.c
>>>>> @@ -697,6 +697,8 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>>>>>                   goto out;
>>>>>           }
>>>>>
>>>>> +       mutex_unlock(&inode->i_mutex);
>>>>> +
>>>>>           memset(&inarg, 0, sizeof(inarg));
>>>>>           inarg.fh = ff->fh;
>>>>>           inarg.fsync_flags = datasync ? 1 : 0;
>>>>> @@ -715,6 +717,7 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>>>>>                           fc->no_fsync = 1;
>>>>>                   err = 0;
>>>>>           }
>>>>> +       return err;
>>>>>    out:
>>>>>           mutex_unlock(&inode->i_mutex);
>>>>>           return err;
>>>>>