[Devel] [PATCH VZ10 v3] fs/fuse: revamp fuse_invalidate_files() to avoid blocking the userspace evloop
Konstantin Khorenko
khorenko at virtuozzo.com
Fri May 15 16:12:15 MSK 2026
On 4/20/26 05:58, Liu Kui wrote:
> On large files, fuse_invalidate_files() can take very long time to complete.
> This is caused by two slow operations that cannot be optimized:
> - filemap_write_and_wait() when the file is under heavy write load, and
> - invalidate_inode_pages2() when the page cache is heavily populated.
>
> These long delays block the userspace evloop (which must not be blocked) and
> can trigger a shaman reboot in the worst case.
>
> To fix this, the following changes are made:
>
> 1. Move the slow cache invalidation work into a dedicated kernel workqueue
> item and replace filemap_write_and_wait() + invalidate_inode_pages2() with
> truncate_pagecache_range() to simplify cache invalidation.
>
> 2. In fuse_invalidate_files(), only set the FUSE_I_INVAL_FILES bit in fi->state
> and schedule the invalidation work for the fuse_inode.
>
> 3. Block new opens of the file while the FUSE_I_INVAL_FILES bit is set.
> The bit is cleared only after the file has been fully invalidated.
> This is necessary because userspace views the file as fully invalidated
> as soon as fuse_invalidate_files() returns.
>
> Additionally, make the fuse trace function available in fuse module so
> that fuse_invalidate_files events can be traced and logged.
>
> Related to
> https://virtuozzo.atlassian.net/browse/VSTOR-124254
>
> Signed-off-by: Liu Kui <kui.liu at virtuozzo.com>
> ---
...
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 0860996c19ad..11fb3996a2ac 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -252,10 +252,11 @@ static void fuse_link_rw_file(struct file *file)
> struct fuse_file *ff = file->private_data;
>
> spin_lock(&fi->lock);
> - if (test_bit(FUSE_I_INVAL_FILES, &fi->state)) {
> + if (unlikely(test_bit(FUSE_I_INVAL_FILES, &fi->state))) {
> spin_lock(&ff->lock);
> set_bit(FUSE_S_FAIL_IMMEDIATELY, &ff->ff_state);
> spin_unlock(&ff->lock);
> + fuse_ktrace(ff->fm->fc, "fuse_file[%llu] --> invalidate_file on [%llu] pending", ff->fh, ff->nodeid);
In fuse_inval_files_work() you print fi->inode - FUSE_ROOT_ID
and said this is intentional to sync with userspace valies.
Why then here we don't subtract FUSE_ROOT_ID as well?
> }
> if (list_empty(&ff->rw_entry))
> list_add(&ff->rw_entry, &fi->rw_files);
> @@ -319,6 +320,13 @@ static int fuse_open(struct inode *inode, struct file *file)
> if ((file->f_flags & O_DIRECT) && !fc->direct_enable)
> return -EINVAL;
>
> + if (unlikely(test_bit(FUSE_I_INVAL_FILES, &fi->state))) {
> + fuse_ktrace(fc, "waiting for invalidate_file on [%llu] to complete", fi->nodeid);
No subtraction of FUSE_ROOT_ID here as well.
Intentionally? Why?
> + err = wait_on_bit(&fi->state, FUSE_I_INVAL_FILES, TASK_KILLABLE);
> + if (err)
> + return err;
> + }
> +
> err = generic_file_open(inode, file);
> if (err)
> return err;
...
> static void fuse_umount_begin(struct super_block *sb)
> @@ -1308,6 +1350,9 @@ int fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
> if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> fuse_backing_files_init(fc);
>
> + INIT_LIST_HEAD(&fc->inval_files_list);
> + INIT_DELAYED_WORK(&fc->inval_files_work, fuse_inval_files_work);
> +
> INIT_LIST_HEAD(&fc->mounts);
> list_add(&fm->fc_entry, &fc->mounts);
> fm->fc = fc;
> @@ -2454,15 +2499,18 @@ static void fuse_inode_init_once(void *foo)
>
> static int __init fuse_fs_init(void)
> {
> - int err;
> + int err = -ENOMEM;
> +
> + fuse_inval_files_wq = alloc_workqueue("fuse_inval_files_wq", WQ_MEM_RECLAIM, 1);
The workqueue is created with max_active=1:
fuse_inval_files_wq = alloc_workqueue("fuse_inval_files_wq", WQ_MEM_RECLAIM, 1);
Each fuse_conn has its own delayed_work (fc->inval_files_work), but the global workqueue runs only
one work item at a time across all connections.
The bottleneck is truncate_pagecache_range(): for each inode it walks the page cache xarray, locks
each page, removes it, and frees it. For a 10 GB cached file that's ~2.6M pages and several seconds of
work. For 100 GB+ files it can take tens of seconds.
This matters during a storage node failover: all fuse connections using that node receive
invalidation notifications at roughly the same time. With max_active=1 they are processed strictly
sequentially. While connection A is truncating a large file, connection B's inodes sit in the queue
with FUSE_I_INVAL_FILES set. New fuse_open() calls on those inodes block in wait_on_bit(), so
applications in connection B's VM see hung opens caused by connection A's workload.
Consider increasing max_active to allow some parallelism (e.g. 4 or 8) or use max_active = 0 which
means - default == 256.
Also, without WQ_UNBOUND the workqueue is "bound" - each work item is pinned to the CPU that queued
it. For a short task this is good (cache locality), but truncate_pagecache_range can run for seconds,
tying up that CPU's worker thread while the scheduler cannot migrate it. Adding WQ_UNBOUND lets the
scheduler run the work on any available CPU.
alloc_workqueue("fuse_inval_files_wq", WQ_MEM_RECLAIM | WQ_UNBOUND, 8);
Note that with WQ_UNBOUND, max_active is enforced per NUMA node rather than globally, so
max_active=1 on a two-node machine would allow two work items to run in parallel (one per node).
> + if (!fuse_inval_files_wq)
> + goto out;
>
> fuse_inode_cachep = kmem_cache_create("fuse_inode",
> sizeof(struct fuse_inode), 0,
> SLAB_HWCACHE_ALIGN|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT,
> fuse_inode_init_once);
> - err = -ENOMEM;
> if (!fuse_inode_cachep)
> - goto out;
> + goto out1;
>
> err = register_fuseblk();
> if (err)
More information about the Devel
mailing list