[Devel] [PATCH VZ10 v3] fs/fuse: revamp fuse_invalidate_files() to avoid blocking the userspace evloop

Fri May 15 16:12:15 MSK 2026

On 4/20/26 05:58, Liu Kui wrote:
> On large files, fuse_invalidate_files() can take very long time to complete.
> This is caused by two slow operations that cannot be optimized:
>   - filemap_write_and_wait() when the file is under heavy write load, and
>   - invalidate_inode_pages2() when the page cache is heavily populated.
> 
> These long delays block the userspace evloop (which must not be blocked) and
> can trigger a shaman reboot in the worst case.
> 
> To fix this, the following changes are made:
> 
> 1. Move the slow cache invalidation work into a dedicated kernel workqueue
>     item and replace filemap_write_and_wait() + invalidate_inode_pages2() with
>     truncate_pagecache_range() to simplify cache invalidation.
> 
> 2. In fuse_invalidate_files(), only set the FUSE_I_INVAL_FILES bit in fi->state
>     and schedule the invalidation work for the fuse_inode.
> 
> 3. Block new opens of the file while the FUSE_I_INVAL_FILES bit is set.
>     The bit is cleared only after the file has been fully invalidated.
>     This is necessary because userspace views the file as fully invalidated
>     as soon as fuse_invalidate_files() returns.
> 
> Additionally, make the fuse trace function available in fuse module so
> that fuse_invalidate_files events can be traced and logged.
> 
> Related to
> https://virtuozzo.atlassian.net/browse/VSTOR-124254
> 
> Signed-off-by: Liu Kui <kui.liu at virtuozzo.com>
> ---
...

> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 0860996c19ad..11fb3996a2ac 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -252,10 +252,11 @@ static void fuse_link_rw_file(struct file *file)
>   	struct fuse_file *ff = file->private_data;
>   
>   	spin_lock(&fi->lock);
> -	if (test_bit(FUSE_I_INVAL_FILES, &fi->state)) {
> +	if (unlikely(test_bit(FUSE_I_INVAL_FILES, &fi->state))) {
>   		spin_lock(&ff->lock);
>   		set_bit(FUSE_S_FAIL_IMMEDIATELY, &ff->ff_state);
>   		spin_unlock(&ff->lock);
> +		fuse_ktrace(ff->fm->fc, "fuse_file[%llu] --> invalidate_file on [%llu] pending", ff->fh, ff->nodeid);

In fuse_inval_files_work() you print fi->inode - FUSE_ROOT_ID
and said this is intentional to sync with userspace valies.

Why then here we don't subtract FUSE_ROOT_ID as well?

>   	}
>   	if (list_empty(&ff->rw_entry))
>   		list_add(&ff->rw_entry, &fi->rw_files);
> @@ -319,6 +320,13 @@ static int fuse_open(struct inode *inode, struct file *file)
>   	if ((file->f_flags & O_DIRECT) && !fc->direct_enable)
>   		return -EINVAL;
>   
> +	if (unlikely(test_bit(FUSE_I_INVAL_FILES, &fi->state))) {
> +		fuse_ktrace(fc, "waiting for invalidate_file on [%llu] to complete", fi->nodeid);

No subtraction of FUSE_ROOT_ID here as well.
Intentionally? Why?

> +		err = wait_on_bit(&fi->state, FUSE_I_INVAL_FILES, TASK_KILLABLE);
> +		if (err)
> +			return err;
> +	}
> +
>   	err = generic_file_open(inode, file);
>   	if (err)
>   		return err;
...

>   static void fuse_umount_begin(struct super_block *sb)
> @@ -1308,6 +1350,9 @@ int fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
>   	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>   		fuse_backing_files_init(fc);
>   
> +	INIT_LIST_HEAD(&fc->inval_files_list);
> +	INIT_DELAYED_WORK(&fc->inval_files_work, fuse_inval_files_work);
> +
>   	INIT_LIST_HEAD(&fc->mounts);
>   	list_add(&fm->fc_entry, &fc->mounts);
>   	fm->fc = fc;
> @@ -2454,15 +2499,18 @@ static void fuse_inode_init_once(void *foo)
>   
>   static int __init fuse_fs_init(void)
>   {
> -	int err;
> +	int err = -ENOMEM;
> +
> +	fuse_inval_files_wq = alloc_workqueue("fuse_inval_files_wq", WQ_MEM_RECLAIM, 1);

   The workqueue is created with max_active=1:

   fuse_inval_files_wq = alloc_workqueue("fuse_inval_files_wq", WQ_MEM_RECLAIM, 1);

   Each fuse_conn has its own delayed_work (fc->inval_files_work), but the global workqueue runs only 
one work item at a time across all connections.

   The bottleneck is truncate_pagecache_range(): for each inode it walks the page cache xarray, locks 
each page, removes it, and frees it. For a 10 GB cached file that's ~2.6M pages and several seconds of 
work. For 100 GB+ files it can take tens of seconds.

   This matters during a storage node failover: all fuse connections using that node receive 
invalidation notifications at roughly the same time. With max_active=1 they are processed strictly 
sequentially. While connection A is truncating  a large file, connection B's inodes sit in the queue 
with FUSE_I_INVAL_FILES set. New fuse_open() calls on those inodes block in wait_on_bit(), so 
applications in connection B's VM see hung opens caused by connection A's workload.

   Consider increasing max_active to allow some parallelism (e.g. 4 or 8) or use max_active = 0 which 
means - default == 256.

   Also, without WQ_UNBOUND the workqueue is "bound" - each work item is pinned to the CPU that queued 
it. For a short task this is good (cache locality), but truncate_pagecache_range can run for seconds, 
tying up that CPU's worker thread while the scheduler cannot migrate it. Adding WQ_UNBOUND lets the 
scheduler run the work on any available CPU.

   alloc_workqueue("fuse_inval_files_wq", WQ_MEM_RECLAIM | WQ_UNBOUND, 8);

   Note that with WQ_UNBOUND, max_active is enforced per NUMA node rather than globally, so 
max_active=1 on a two-node machine would allow two work items to run in parallel (one per node).

> +	if (!fuse_inval_files_wq)
> +		goto out;
>   
>   	fuse_inode_cachep = kmem_cache_create("fuse_inode",
>   			sizeof(struct fuse_inode), 0,
>   			SLAB_HWCACHE_ALIGN|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT,
>   			fuse_inode_init_once);
> -	err = -ENOMEM;
>   	if (!fuse_inode_cachep)
> -		goto out;
> +		goto out1;
>   
>   	err = register_fuseblk();
>   	if (err)