[Devel] [PATCH VZ10] fs/fuse: revamp fuse_invalidate_files() to avoid blocking the userspace evloop

Konstantin Khorenko khorenko at virtuozzo.com
Tue Mar 24 20:41:38 MSK 2026


Lesha, please review the code and send an ack.

+ AI review:

   Review: fs/fuse: revamp fuse_invalidate_files() to avoid blocking the userspace evloop



   CRITICAL BUGS

   1. `list_add_tail` arguments are swapped (2 places)
   In fuse_inval_files_work(), both error paths have the arguments reversed:

   list_add_tail(&failed_list, &fi->inval_files_entry);  // WRONG

   list_add_tail(new, head) — first arg is the new element, second is the list head. Must be:

   list_add_tail(&fi->inval_files_entry, &failed_list);  // CORRECT

   This will corrupt the linked list, leading to crashes. Appears on lines 233 and 247 of the patch.
   2. `iput()` called under spinlock — can sleep
   In the new early-return path in fuse_invalidate_files():

   spin_lock(&fi->lock);
   if (test_bit(FUSE_I_INVAL_FILES, &fi->state)) {
       iput(inode);            // <-- BUG: can sleep under spinlock!
       spin_unlock(&fi->lock);
       return 0;
   }

   iput() can call iput_final() which sleeps (writeback, eviction). The spin_unlock must come before iput:

   spin_lock(&fi->lock);
   if (test_bit(FUSE_I_INVAL_FILES, &fi->state)) {
       spin_unlock(&fi->lock);
       iput(inode);
       return 0;
   }



   MODERATE ISSUES

   3. Incorrect nodeid in trace output

   nodeid = get_node_id(&fi->inode) - FUSE_ROOT_ID;

   FUSE_ROOT_ID is 1. This subtracts 1 from every nodeid in trace messages, producing wrong values. 
Should just be
   get_node_id(&fi->inode).
   4. `INIT_LIST_HEAD(&fi->inval_files_entry)` missing in `fuse_alloc_inode()`
   The new inval_files_entry list_head is never initialized at inode allocation. With 
CONFIG_DEBUG_LIST, list_add_tail
   after a list_del on a poisoned node will trigger warnings. Should add 
INIT_LIST_HEAD(&fi->inval_files_entry) alongside
    existing INIT_LIST_HEAD(&fi->rw_files).
   5. `create_workqueue()` is deprecated

   fuse_inval_files_wq = create_workqueue("fuse_inval_files_wq");

   Should use alloc_workqueue():

   fuse_inval_files_wq = alloc_workqueue("fuse_inval_files", WQ_UNBOUND, 0);

   6. Hot retry loop without delay
   When filemap_write_and_wait() or invalidate_inode_pages2() fails, the work is immediately 
re-queued. If the error
   persists, this will spin the CPU hot. Should use queue_delayed_work() with a delay.
   7. Comment in `fuse_writepages_fill` is now stale

   /* More than optimization: writeback pages to /dev/null; fused would
    * drop our FUSE_WRITE requests anyway, but it will be blocked while
    * sending NOTIFY_INVAL_FILES until we return!
    */

   This comment says "it will be blocked while sending NOTIFY_INVAL_FILES until we return" — but after 
this patch,
   fuse_invalidate_files() no longer blocks. The return value change from 0 to -EIO also changes the 
semantics: -EIO will
    propagate errors to the writeback machinery rather than silently skipping.


   STYLE / MINOR

   8. Missing space:

   fuse_inval_files_wq= create_workqueue(...)
   //                 ^ missing space

   9. Missing space:

   while(!list_empty(&inval_files_list)) {
   //   ^ missing space

   10. `fuse_ktrace` macro:
   • Uses __FUNCTION__ — should be __func__ (C99 standard)
   • Missing space: __fc,"%s: " → __fc, "%s: "
   • Very long single-line macro, hard to read

   11. Missing space in trace call:

   fuse_ktrace(ff->fm->fc,"fuse_file[...]  // missing space after comma


   ────────────────────────────────────────



   Verdict

   Patch needs rework. The swapped list_add_tail arguments are an instant crash, and iput under 
spinlock is a
   sleeping-in-atomic bug. The nodeid offset is also clearly wrong.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 3/19/26 17:14, Liu Kui wrote:
> On large files, fuse_invalidate_files() can take a very long time to complete.
> This is caused by two slow operations that cannot be optimized:
>   - filemap_write_and_wait() when the file is under heavy write load, and
>   - invalidate_inode_pages2() when the page cache is heavily populated.
> 
> These long delays block the userspace evloop (which must not be blocked) and
> can trigger a shaman reboot in the worst case.
> 
> To fix this, the following changes are made:
> 
> 1. Move the execution of filemap_write_and_wait() and invalidate_inode_pages2()
>     into a dedicated kernel workqueue item.
> 
> 2. In fuse_invalidate_files(), only set the FUSE_I_INVAL_FILES bit in fi->state
>     and schedule the invalidation work for the fuse_inode.
> 
> 3. Block new opens of the file while the FUSE_I_INVAL_FILES bit is set.
>     The bit is cleared only after the file has been fully invalidated.
>     This is necessary because userspace views the file as fully invalidated
>     as soon as fuse_invalidate_files() returns.
> 
> Additionally, make the fuse trace function available in fuse module so
> that fuse_invalidate_files events can be traced and logged.
> 
> Related to
> https://virtuozzo.atlassian.net/browse/VSTOR-124254
> 
> Signed-off-by: Liu Kui <kui.liu at virtuozzo.com>
> ---
>   fs/fuse/dev.c                      |   2 +-
>   fs/fuse/file.c                     |  26 +++++--
>   fs/fuse/fuse_i.h                   |  16 ++++-
>   fs/fuse/inode.c                    | 112 ++++++++++++++++++++++++-----
>   fs/fuse/kio/pcs/pcs_fuse_kdirect.c |  32 +++++++--
>   5 files changed, 157 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index c1102069d032..4fcfd644dcf6 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -110,7 +110,7 @@ static bool fuse_block_alloc(struct fuse_conn *fc, bool for_background)
>   	return !fc->initialized || (for_background && fc->blocked);
>   }
>   
> -static void fuse_drop_waiting(struct fuse_conn *fc)
> +void fuse_drop_waiting(struct fuse_conn *fc)
>   {
>   	/*
>   	 * lockess check of fc->connected is okay, because atomic_dec_and_test()
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 0860996c19ad..4037b4c53df4 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -252,10 +252,11 @@ static void fuse_link_rw_file(struct file *file)
>   	struct fuse_file *ff = file->private_data;
>   
>   	spin_lock(&fi->lock);
> -	if (test_bit(FUSE_I_INVAL_FILES, &fi->state)) {
> +	if (unlikely(test_bit(FUSE_I_INVAL_FILES, &fi->state))) {
>   		spin_lock(&ff->lock);
>   		set_bit(FUSE_S_FAIL_IMMEDIATELY, &ff->ff_state);
>   		spin_unlock(&ff->lock);
> +		fuse_ktrace(ff->fm->fc,"fuse_file[%llu] --> invalidate_file on [%llu] pending", ff->fh, ff->nodeid);
>   	}
>   	if (list_empty(&ff->rw_entry))
>   		list_add(&ff->rw_entry, &fi->rw_files);
> @@ -319,6 +320,13 @@ static int fuse_open(struct inode *inode, struct file *file)
>   	if ((file->f_flags & O_DIRECT) && !fc->direct_enable)
>   		return -EINVAL;
>   
> +	if (unlikely(test_bit(FUSE_I_INVAL_FILES, &fi->state))) {
> +		fuse_ktrace(fc, "waiting for invalidate_file on [%llu] to complete", fi->nodeid);
> +		err = wait_on_bit(&fi->state, FUSE_I_INVAL_FILES, TASK_KILLABLE);
> +		if (err)
> +			return err;
> +	}
> +
>   	err = generic_file_open(inode, file);
>   	if (err)
>   		return err;
> @@ -361,8 +369,6 @@ static int fuse_open(struct inode *inode, struct file *file)
>   		inode_unlock(inode);
>   
>   	if (!err && fc->close_wait) {
> -		struct fuse_inode *fi = get_fuse_inode(inode);
> -
>   		inode_lock(inode);
>   		spin_lock(&fi->lock);
>   
> @@ -1409,6 +1415,12 @@ static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to)
>   			return err;
>   	}
>   
> +	/*
> +	 * Block read if the file had been invalidated.
> +	 */
> +	 if (fuse_file_fail_immediately(iocb->ki_filp->private_data))
> +		return -EIO;
> +
>   	return generic_file_read_iter(iocb, to);
>   }
>   
> @@ -1794,6 +1806,12 @@ static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
>   			goto writethrough;
>   		}
>   
> +		/*
> +		 * Block write if the file had been invalidated.
> +		 */
> +		if (fuse_file_fail_immediately(file->private_data))
> +			return -EIO;
> +
>   		return generic_file_write_iter(iocb, from);
>   	}
>   
> @@ -2710,7 +2728,7 @@ static int fuse_writepages_fill(struct folio *folio,
>   	 */
>   	if (!wpa && test_bit(FUSE_I_INVAL_FILES, &fi->state)) {
>   		unlock_page(&folio->page);
> -		return 0;
> +		return -EIO;
>   	}
>   
>   	if (wpa && fuse_writepage_need_send(fc, &folio->page, ap, data)) {
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 853bf12e282d..f8e4a4d2119d 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -215,6 +215,9 @@ struct fuse_inode {
>   		atomic_t read_count;
>   		atomic_t write_count;
>   	} dio;
> +
> +	/** Entry on fc->inval_files_list list */
> +	struct list_head inval_files_entry;
>   };
>   
>   /** FUSE inode state bits */
> @@ -1110,7 +1113,13 @@ struct fuse_conn {
>   	} kio;
>   
>   	int ktrace_level;
> -	struct fuse_ktrace * ktrace;
> +	struct fuse_ktrace *ktrace;
> +	void (*fuse_ktrace_fn)(struct fuse_conn *fc, const char *fmt, ...);
> +
> +	/* List of fuse_inodes to be invalidated by userspace */
> +	struct list_head inval_files_list;
> +	struct work_struct inval_files_work;
> +
>   	struct dentry *conn_ctl;
>   
>   	/* New writepages go into this bucket */
> @@ -1122,6 +1131,8 @@ struct fuse_conn {
>   #endif
>   };
>   
> +#define fuse_ktrace(fc, fmt, args...) do { struct fuse_conn *__fc = (fc); if (__fc->fuse_ktrace_fn) __fc->fuse_ktrace_fn(__fc,"%s: " fmt, __FUNCTION__,  ## args); } while (0)
> +
>   /*
>    * Represents a mounted filesystem, potentially a submount.
>    *
> @@ -1552,7 +1563,7 @@ static inline void fuse_dio_wait(struct fuse_inode *fi)
>   
>   static inline bool fuse_file_fail_immediately(struct fuse_file *ff)
>   {
> -	return ff && test_bit(FUSE_S_FAIL_IMMEDIATELY, &ff->ff_state);
> +	return unlikely(ff && test_bit(FUSE_S_FAIL_IMMEDIATELY, &ff->ff_state));
>   }
>   
>   /**
> @@ -1717,6 +1728,7 @@ void fuse_file_release(struct inode *inode, struct fuse_file *ff,
>   
>   struct fuse_kio_ops *fuse_kio_get(struct fuse_conn *fc, char *name);
>   void fuse_kio_put(struct fuse_kio_ops *ops);
> +void fuse_drop_waiting(struct fuse_conn *fc);
>   
>   /* passthrough.c */
>   static inline struct fuse_backing *fuse_inode_backing(struct fuse_inode *fi)
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index f167d275885b..4b60b9ed91b2 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -35,6 +35,8 @@ struct list_head fuse_conn_list;
>   DEFINE_MUTEX(fuse_mutex);
>   EXPORT_SYMBOL_GPL(fuse_mutex);
>   
> +struct workqueue_struct *fuse_inval_files_wq;
> +
>   static int fuse_ve_odirect;
>   
>   static int set_global_limit(const char *val, const struct kernel_param *kp);
> @@ -603,12 +605,81 @@ void fuse_unlock_inode(struct inode *inode, bool locked)
>   		mutex_unlock(&get_fuse_inode(inode)->mutex);
>   }
>   
> +static void fuse_inval_files_work(struct work_struct *w)
> +{
> +	struct fuse_conn *fc = container_of(w, struct fuse_conn, inval_files_work);
> +	struct list_head inval_files_list;
> +	struct list_head failed_list;
> +	struct fuse_file *ff;
> +	struct fuse_inode *fi;
> +	bool to_retry;
> +	int err;
> +
> +	INIT_LIST_HEAD(&inval_files_list);
> +	INIT_LIST_HEAD(&failed_list);
> +
> +	spin_lock(&fc->lock);
> +	list_splice_init(&fc->inval_files_list, &inval_files_list);
> +	to_retry = fc->connected;
> +	spin_unlock(&fc->lock);
> +
> +	while(!list_empty(&inval_files_list)) {
> +		u64 nodeid;
> +
> +		fi = list_first_entry(&inval_files_list, struct fuse_inode, inval_files_entry);
> +		list_del(&fi->inval_files_entry);
> +		nodeid = get_node_id(&fi->inode) - FUSE_ROOT_ID;
> +		fuse_ktrace(fc, "invalidate_file on [%llu] starts", nodeid);
> +
> +		err = filemap_write_and_wait(fi->inode.i_mapping);
> +		if (err && err != -EIO && to_retry) {
> +			fuse_ktrace(fc, "filemap_write_and_wait() on [%llu] returns err=%d", nodeid, err);
> +			list_add_tail(&failed_list, &fi->inval_files_entry);
> +			continue;
> +		}
> +
> +		spin_lock(&fi->lock);
> +		list_for_each_entry(ff, &fi->rw_files, rw_entry)
> +			fuse_revoke_readpages(ff);
> +		spin_unlock(&fi->lock);
> +
> +		wake_up(&fi->page_waitq); /* readpage[s] can wait on fuse wb */
> +
> +		err = invalidate_inode_pages2(fi->inode.i_mapping);
> +		if (err && to_retry) {
> +			fuse_ktrace(fc, "invalidate_inode_pages2() on [%llu] returns err=%d", nodeid, err);
> +			list_add_tail(&failed_list, &fi->inval_files_entry);
> +			continue;
> +		}
> +
> +		fuse_invalidate_attr(&fi->inode);
> +
> +		spin_lock(&fi->lock);
> +		clear_bit(FUSE_I_INVAL_FILES, &fi->state);
> +		wake_up_bit(&fi->state, FUSE_I_INVAL_FILES);
> +		spin_unlock(&fi->lock);
> +
> +		fuse_ktrace(fc, "invalidate_file on [%llu] ends", nodeid);
> +		iput(&fi->inode);
> +	}
> +
> +	if (!list_empty(&failed_list)) {
> +		spin_lock(&fc->lock);
> +		list_splice_init(&failed_list, &fc->inval_files_list);
> +		spin_unlock(&fc->lock);
> +		if (queue_work(fuse_inval_files_wq, &fc->inval_files_work))
> +			return;
> +	}
> +
> +	fuse_drop_waiting(fc);
> +}
> +
>   int fuse_invalidate_files(struct fuse_conn *fc, u64 nodeid)
>   {
>   	struct inode *inode;
>   	struct fuse_inode *fi;
>   	struct fuse_file *ff;
> -	int err, i;
> +	int i;
>   
>   	if (!fc->async_read) {
>   		printk(KERN_ERR "Turn async_read ON to use "
> @@ -624,6 +695,11 @@ int fuse_invalidate_files(struct fuse_conn *fc, u64 nodeid)
>   
>   	/* Mark that invalidate files is in progress */
>   	spin_lock(&fi->lock);
> +	if (test_bit(FUSE_I_INVAL_FILES, &fi->state)) {
> +		iput(inode);
> +		spin_unlock(&fi->lock);
> +		return 0;
> +	}
>   	set_bit(FUSE_I_INVAL_FILES, &fi->state);
>   	list_for_each_entry(ff, &fi->rw_files, rw_entry) {
>   		spin_lock(&ff->lock);
> @@ -638,23 +714,14 @@ int fuse_invalidate_files(struct fuse_conn *fc, u64 nodeid)
>   	for (i = 0; i < FUSE_QHASH_SIZE; i++)
>   		wake_up_all(&fc->qhash[i].waitq);
>   
> -	err = filemap_write_and_wait(inode->i_mapping);
> -	if (!err || err == -EIO) { /* AS_EIO might trigger -EIO */
> -		spin_lock(&fi->lock);
> -		list_for_each_entry(ff, &fi->rw_files, rw_entry)
> -			fuse_revoke_readpages(ff);
> -		spin_unlock(&fi->lock);
> -
> -		wake_up(&fi->page_waitq); /* readpage[s] can wait on fuse wb */
> -		err = invalidate_inode_pages2(inode->i_mapping);
> -	}
> -
> -	if (!err)
> -		fuse_invalidate_attr(inode);
> +	atomic_inc(&fc->num_waiting);
> +	spin_lock(&fc->lock);
> +	list_add_tail(&fi->inval_files_entry, &fc->inval_files_list);
> +	spin_unlock(&fc->lock);
> +	if (!queue_work(fuse_inval_files_wq, &fc->inval_files_work))
> +		fuse_drop_waiting(fc);
>   
> -	clear_bit(FUSE_I_INVAL_FILES, &fi->state);
> -	iput(inode);
> -	return err;
> +	return 0;
>   }
>   
>   static void fuse_umount_begin(struct super_block *sb)
> @@ -1308,6 +1375,8 @@ int fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
>   	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>   		fuse_backing_files_init(fc);
>   
> +	INIT_LIST_HEAD(&fc->inval_files_list);
> +	INIT_WORK(&fc->inval_files_work, fuse_inval_files_work);
>   	INIT_LIST_HEAD(&fc->mounts);
>   	list_add(&fm->fc_entry, &fc->mounts);
>   	fm->fc = fc;
> @@ -2456,13 +2525,17 @@ static int __init fuse_fs_init(void)
>   {
>   	int err;
>   
> +	fuse_inval_files_wq= create_workqueue("fuse_inval_files_wq");
> +	if (!fuse_inval_files_wq)
> +		goto out;
> +
>   	fuse_inode_cachep = kmem_cache_create("fuse_inode",
>   			sizeof(struct fuse_inode), 0,
>   			SLAB_HWCACHE_ALIGN|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT,
>   			fuse_inode_init_once);
>   	err = -ENOMEM;
>   	if (!fuse_inode_cachep)
> -		goto out;
> +		goto out1;
>   
>   	err = register_fuseblk();
>   	if (err)
> @@ -2478,6 +2551,8 @@ static int __init fuse_fs_init(void)
>   	unregister_fuseblk();
>    out2:
>   	kmem_cache_destroy(fuse_inode_cachep);
> + out1:
> +	destroy_workqueue(fuse_inval_files_wq);
>    out:
>   	return err;
>   }
> @@ -2493,6 +2568,7 @@ static void fuse_fs_cleanup(void)
>   	 */
>   	rcu_barrier();
>   	kmem_cache_destroy(fuse_inode_cachep);
> +	destroy_workqueue(fuse_inval_files_wq);
>   }
>   
>   static struct kobject *fuse_kobj;
> diff --git a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
> index eafe2ee2313b..a4c41dc7f60f 100644
> --- a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
> +++ b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
> @@ -158,6 +158,7 @@ MODULE_PARM_DESC(rdmaio_io_failing, "Enable/Disbale RDMA io failing");
>   
>   static int fuse_ktrace_setup(struct fuse_conn * fc);
>   static int fuse_ktrace_remove(struct fuse_conn *fc);
> +static void kfuse_trace(struct fuse_conn *fc, const char *fmt, ...);
>   
>   static struct kmem_cache *pcs_fuse_req_cachep;
>   static struct kmem_cache *pcs_ireq_cachep;
> @@ -1672,6 +1673,8 @@ static int fuse_ktrace_setup(struct fuse_conn * fc)
>   		goto err;
>   	}
>   
> +	fc->fuse_ktrace_fn = kfuse_trace;
> +
>   	return 0;
>   
>   err:
> @@ -1680,22 +1683,19 @@ static int fuse_ktrace_setup(struct fuse_conn * fc)
>   	return ret;
>   }
>   
> -void __kfuse_trace(struct fuse_conn * fc, unsigned long ip, const char * fmt, ...)
> +static void kfuse_tracer(struct fuse_conn * fc, unsigned long ip, const char *fmt, va_list va)
>   {
> -	struct fuse_ktrace * tr;
> -        va_list va;
> +	struct fuse_ktrace *tr;
>   	int cpu;
>   
>   	cpu = get_cpu();
>   	tr = fc->ktrace;
>   	if (tr) {
>   		u8 * buf = per_cpu_ptr(tr->buf, cpu);
> -		struct fuse_trace_hdr * t;
> +		struct fuse_trace_hdr *t;
>   		int len;
>   
> -		va_start(va, fmt);
>   		len = vsnprintf(buf, KTRACE_LOG_BUF_SIZE, fmt, va);
> -		va_end(va);
>   		t = fuse_trace_prepare(tr, FUSE_KTRACE_STRING, len + 1);
>   		if (t)
>   			memcpy(t + 1, buf, len + 1);
> @@ -1710,6 +1710,26 @@ void __kfuse_trace(struct fuse_conn * fc, unsigned long ip, const char * fmt, ..
>   	put_cpu();
>   }
>   
> +void __kfuse_trace(struct fuse_conn * fc, unsigned long ip, const char * fmt, ...)
> +{
> +	va_list va;
> +
> +	va_start(va, fmt);
> +	kfuse_tracer(fc, ip, fmt, va);
> +	va_end(va);
> +}
> +
> +static void kfuse_trace(struct fuse_conn *fc, const char *fmt, ...)
> +{
> +	va_list va;
> +
> +	if (fc->ktrace_level >= LOG_TRACE) {
> +		va_start(va, fmt);
> +		kfuse_tracer(fc, 0, fmt, va);
> +		va_end(va);
> +	}
> +}
> +
>   void pcs_kio_file_list(struct fuse_conn *fc, kio_file_itr kfile_cb, void *ctx)
>   {
>   	struct fuse_file *ff;



More information about the Devel mailing list