[Devel] [PATCH VZ10] fs/fuse: revamp fuse_invalidate_files() to avoid blocking the userspace evloop
Konstantin Khorenko
khorenko at virtuozzo.com
Tue Mar 24 20:41:38 MSK 2026
Lesha, please review the code and send an ack.
+ AI review:
Review: fs/fuse: revamp fuse_invalidate_files() to avoid blocking the userspace evloop
CRITICAL BUGS
1. `list_add_tail` arguments are swapped (2 places)
In fuse_inval_files_work(), both error paths have the arguments reversed:
list_add_tail(&failed_list, &fi->inval_files_entry); // WRONG
list_add_tail(new, head) — first arg is the new element, second is the list head. Must be:
list_add_tail(&fi->inval_files_entry, &failed_list); // CORRECT
This will corrupt the linked list, leading to crashes. Appears on lines 233 and 247 of the patch.
2. `iput()` called under spinlock — can sleep
In the new early-return path in fuse_invalidate_files():
spin_lock(&fi->lock);
if (test_bit(FUSE_I_INVAL_FILES, &fi->state)) {
iput(inode); // <-- BUG: can sleep under spinlock!
spin_unlock(&fi->lock);
return 0;
}
iput() can call iput_final() which sleeps (writeback, eviction). The spin_unlock must come before iput:
spin_lock(&fi->lock);
if (test_bit(FUSE_I_INVAL_FILES, &fi->state)) {
spin_unlock(&fi->lock);
iput(inode);
return 0;
}
MODERATE ISSUES
3. Incorrect nodeid in trace output
nodeid = get_node_id(&fi->inode) - FUSE_ROOT_ID;
FUSE_ROOT_ID is 1. This subtracts 1 from every nodeid in trace messages, producing wrong values.
Should just be
get_node_id(&fi->inode).
4. `INIT_LIST_HEAD(&fi->inval_files_entry)` missing in `fuse_alloc_inode()`
The new inval_files_entry list_head is never initialized at inode allocation. With
CONFIG_DEBUG_LIST, list_add_tail
after a list_del on a poisoned node will trigger warnings. Should add
INIT_LIST_HEAD(&fi->inval_files_entry) alongside
existing INIT_LIST_HEAD(&fi->rw_files).
5. `create_workqueue()` is deprecated
fuse_inval_files_wq = create_workqueue("fuse_inval_files_wq");
Should use alloc_workqueue():
fuse_inval_files_wq = alloc_workqueue("fuse_inval_files", WQ_UNBOUND, 0);
6. Hot retry loop without delay
When filemap_write_and_wait() or invalidate_inode_pages2() fails, the work is immediately
re-queued. If the error
persists, this will spin the CPU hot. Should use queue_delayed_work() with a delay.
7. Comment in `fuse_writepages_fill` is now stale
/* More than optimization: writeback pages to /dev/null; fused would
* drop our FUSE_WRITE requests anyway, but it will be blocked while
* sending NOTIFY_INVAL_FILES until we return!
*/
This comment says "it will be blocked while sending NOTIFY_INVAL_FILES until we return" — but after
this patch,
fuse_invalidate_files() no longer blocks. The return value change from 0 to -EIO also changes the
semantics: -EIO will
propagate errors to the writeback machinery rather than silently skipping.
STYLE / MINOR
8. Missing space:
fuse_inval_files_wq= create_workqueue(...)
// ^ missing space
9. Missing space:
while(!list_empty(&inval_files_list)) {
// ^ missing space
10. `fuse_ktrace` macro:
• Uses __FUNCTION__ — should be __func__ (C99 standard)
• Missing space: __fc,"%s: " → __fc, "%s: "
• Very long single-line macro, hard to read
11. Missing space in trace call:
fuse_ktrace(ff->fm->fc,"fuse_file[...] // missing space after comma
────────────────────────────────────────
Verdict
Patch needs rework. The swapped list_add_tail arguments are an instant crash, and iput under
spinlock is a
sleeping-in-atomic bug. The nodeid offset is also clearly wrong.
--
Best regards,
Konstantin Khorenko,
Virtuozzo Linux Kernel Team
On 3/19/26 17:14, Liu Kui wrote:
> On large files, fuse_invalidate_files() can take a very long time to complete.
> This is caused by two slow operations that cannot be optimized:
> - filemap_write_and_wait() when the file is under heavy write load, and
> - invalidate_inode_pages2() when the page cache is heavily populated.
>
> These long delays block the userspace evloop (which must not be blocked) and
> can trigger a shaman reboot in the worst case.
>
> To fix this, the following changes are made:
>
> 1. Move the execution of filemap_write_and_wait() and invalidate_inode_pages2()
> into a dedicated kernel workqueue item.
>
> 2. In fuse_invalidate_files(), only set the FUSE_I_INVAL_FILES bit in fi->state
> and schedule the invalidation work for the fuse_inode.
>
> 3. Block new opens of the file while the FUSE_I_INVAL_FILES bit is set.
> The bit is cleared only after the file has been fully invalidated.
> This is necessary because userspace views the file as fully invalidated
> as soon as fuse_invalidate_files() returns.
>
> Additionally, make the fuse trace function available in fuse module so
> that fuse_invalidate_files events can be traced and logged.
>
> Related to
> https://virtuozzo.atlassian.net/browse/VSTOR-124254
>
> Signed-off-by: Liu Kui <kui.liu at virtuozzo.com>
> ---
> fs/fuse/dev.c | 2 +-
> fs/fuse/file.c | 26 +++++--
> fs/fuse/fuse_i.h | 16 ++++-
> fs/fuse/inode.c | 112 ++++++++++++++++++++++++-----
> fs/fuse/kio/pcs/pcs_fuse_kdirect.c | 32 +++++++--
> 5 files changed, 157 insertions(+), 31 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index c1102069d032..4fcfd644dcf6 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -110,7 +110,7 @@ static bool fuse_block_alloc(struct fuse_conn *fc, bool for_background)
> return !fc->initialized || (for_background && fc->blocked);
> }
>
> -static void fuse_drop_waiting(struct fuse_conn *fc)
> +void fuse_drop_waiting(struct fuse_conn *fc)
> {
> /*
> * lockess check of fc->connected is okay, because atomic_dec_and_test()
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 0860996c19ad..4037b4c53df4 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -252,10 +252,11 @@ static void fuse_link_rw_file(struct file *file)
> struct fuse_file *ff = file->private_data;
>
> spin_lock(&fi->lock);
> - if (test_bit(FUSE_I_INVAL_FILES, &fi->state)) {
> + if (unlikely(test_bit(FUSE_I_INVAL_FILES, &fi->state))) {
> spin_lock(&ff->lock);
> set_bit(FUSE_S_FAIL_IMMEDIATELY, &ff->ff_state);
> spin_unlock(&ff->lock);
> + fuse_ktrace(ff->fm->fc,"fuse_file[%llu] --> invalidate_file on [%llu] pending", ff->fh, ff->nodeid);
> }
> if (list_empty(&ff->rw_entry))
> list_add(&ff->rw_entry, &fi->rw_files);
> @@ -319,6 +320,13 @@ static int fuse_open(struct inode *inode, struct file *file)
> if ((file->f_flags & O_DIRECT) && !fc->direct_enable)
> return -EINVAL;
>
> + if (unlikely(test_bit(FUSE_I_INVAL_FILES, &fi->state))) {
> + fuse_ktrace(fc, "waiting for invalidate_file on [%llu] to complete", fi->nodeid);
> + err = wait_on_bit(&fi->state, FUSE_I_INVAL_FILES, TASK_KILLABLE);
> + if (err)
> + return err;
> + }
> +
> err = generic_file_open(inode, file);
> if (err)
> return err;
> @@ -361,8 +369,6 @@ static int fuse_open(struct inode *inode, struct file *file)
> inode_unlock(inode);
>
> if (!err && fc->close_wait) {
> - struct fuse_inode *fi = get_fuse_inode(inode);
> -
> inode_lock(inode);
> spin_lock(&fi->lock);
>
> @@ -1409,6 +1415,12 @@ static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to)
> return err;
> }
>
> + /*
> + * Block read if the file had been invalidated.
> + */
> + if (fuse_file_fail_immediately(iocb->ki_filp->private_data))
> + return -EIO;
> +
> return generic_file_read_iter(iocb, to);
> }
>
> @@ -1794,6 +1806,12 @@ static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
> goto writethrough;
> }
>
> + /*
> + * Block write if the file had been invalidated.
> + */
> + if (fuse_file_fail_immediately(file->private_data))
> + return -EIO;
> +
> return generic_file_write_iter(iocb, from);
> }
>
> @@ -2710,7 +2728,7 @@ static int fuse_writepages_fill(struct folio *folio,
> */
> if (!wpa && test_bit(FUSE_I_INVAL_FILES, &fi->state)) {
> unlock_page(&folio->page);
> - return 0;
> + return -EIO;
> }
>
> if (wpa && fuse_writepage_need_send(fc, &folio->page, ap, data)) {
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 853bf12e282d..f8e4a4d2119d 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -215,6 +215,9 @@ struct fuse_inode {
> atomic_t read_count;
> atomic_t write_count;
> } dio;
> +
> + /** Entry on fc->inval_files_list list */
> + struct list_head inval_files_entry;
> };
>
> /** FUSE inode state bits */
> @@ -1110,7 +1113,13 @@ struct fuse_conn {
> } kio;
>
> int ktrace_level;
> - struct fuse_ktrace * ktrace;
> + struct fuse_ktrace *ktrace;
> + void (*fuse_ktrace_fn)(struct fuse_conn *fc, const char *fmt, ...);
> +
> + /* List of fuse_inodes to be invalidated by userspace */
> + struct list_head inval_files_list;
> + struct work_struct inval_files_work;
> +
> struct dentry *conn_ctl;
>
> /* New writepages go into this bucket */
> @@ -1122,6 +1131,8 @@ struct fuse_conn {
> #endif
> };
>
> +#define fuse_ktrace(fc, fmt, args...) do { struct fuse_conn *__fc = (fc); if (__fc->fuse_ktrace_fn) __fc->fuse_ktrace_fn(__fc,"%s: " fmt, __FUNCTION__, ## args); } while (0)
> +
> /*
> * Represents a mounted filesystem, potentially a submount.
> *
> @@ -1552,7 +1563,7 @@ static inline void fuse_dio_wait(struct fuse_inode *fi)
>
> static inline bool fuse_file_fail_immediately(struct fuse_file *ff)
> {
> - return ff && test_bit(FUSE_S_FAIL_IMMEDIATELY, &ff->ff_state);
> + return unlikely(ff && test_bit(FUSE_S_FAIL_IMMEDIATELY, &ff->ff_state));
> }
>
> /**
> @@ -1717,6 +1728,7 @@ void fuse_file_release(struct inode *inode, struct fuse_file *ff,
>
> struct fuse_kio_ops *fuse_kio_get(struct fuse_conn *fc, char *name);
> void fuse_kio_put(struct fuse_kio_ops *ops);
> +void fuse_drop_waiting(struct fuse_conn *fc);
>
> /* passthrough.c */
> static inline struct fuse_backing *fuse_inode_backing(struct fuse_inode *fi)
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index f167d275885b..4b60b9ed91b2 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -35,6 +35,8 @@ struct list_head fuse_conn_list;
> DEFINE_MUTEX(fuse_mutex);
> EXPORT_SYMBOL_GPL(fuse_mutex);
>
> +struct workqueue_struct *fuse_inval_files_wq;
> +
> static int fuse_ve_odirect;
>
> static int set_global_limit(const char *val, const struct kernel_param *kp);
> @@ -603,12 +605,81 @@ void fuse_unlock_inode(struct inode *inode, bool locked)
> mutex_unlock(&get_fuse_inode(inode)->mutex);
> }
>
> +static void fuse_inval_files_work(struct work_struct *w)
> +{
> + struct fuse_conn *fc = container_of(w, struct fuse_conn, inval_files_work);
> + struct list_head inval_files_list;
> + struct list_head failed_list;
> + struct fuse_file *ff;
> + struct fuse_inode *fi;
> + bool to_retry;
> + int err;
> +
> + INIT_LIST_HEAD(&inval_files_list);
> + INIT_LIST_HEAD(&failed_list);
> +
> + spin_lock(&fc->lock);
> + list_splice_init(&fc->inval_files_list, &inval_files_list);
> + to_retry = fc->connected;
> + spin_unlock(&fc->lock);
> +
> + while(!list_empty(&inval_files_list)) {
> + u64 nodeid;
> +
> + fi = list_first_entry(&inval_files_list, struct fuse_inode, inval_files_entry);
> + list_del(&fi->inval_files_entry);
> + nodeid = get_node_id(&fi->inode) - FUSE_ROOT_ID;
> + fuse_ktrace(fc, "invalidate_file on [%llu] starts", nodeid);
> +
> + err = filemap_write_and_wait(fi->inode.i_mapping);
> + if (err && err != -EIO && to_retry) {
> + fuse_ktrace(fc, "filemap_write_and_wait() on [%llu] returns err=%d", nodeid, err);
> + list_add_tail(&failed_list, &fi->inval_files_entry);
> + continue;
> + }
> +
> + spin_lock(&fi->lock);
> + list_for_each_entry(ff, &fi->rw_files, rw_entry)
> + fuse_revoke_readpages(ff);
> + spin_unlock(&fi->lock);
> +
> + wake_up(&fi->page_waitq); /* readpage[s] can wait on fuse wb */
> +
> + err = invalidate_inode_pages2(fi->inode.i_mapping);
> + if (err && to_retry) {
> + fuse_ktrace(fc, "invalidate_inode_pages2() on [%llu] returns err=%d", nodeid, err);
> + list_add_tail(&failed_list, &fi->inval_files_entry);
> + continue;
> + }
> +
> + fuse_invalidate_attr(&fi->inode);
> +
> + spin_lock(&fi->lock);
> + clear_bit(FUSE_I_INVAL_FILES, &fi->state);
> + wake_up_bit(&fi->state, FUSE_I_INVAL_FILES);
> + spin_unlock(&fi->lock);
> +
> + fuse_ktrace(fc, "invalidate_file on [%llu] ends", nodeid);
> + iput(&fi->inode);
> + }
> +
> + if (!list_empty(&failed_list)) {
> + spin_lock(&fc->lock);
> + list_splice_init(&failed_list, &fc->inval_files_list);
> + spin_unlock(&fc->lock);
> + if (queue_work(fuse_inval_files_wq, &fc->inval_files_work))
> + return;
> + }
> +
> + fuse_drop_waiting(fc);
> +}
> +
> int fuse_invalidate_files(struct fuse_conn *fc, u64 nodeid)
> {
> struct inode *inode;
> struct fuse_inode *fi;
> struct fuse_file *ff;
> - int err, i;
> + int i;
>
> if (!fc->async_read) {
> printk(KERN_ERR "Turn async_read ON to use "
> @@ -624,6 +695,11 @@ int fuse_invalidate_files(struct fuse_conn *fc, u64 nodeid)
>
> /* Mark that invalidate files is in progress */
> spin_lock(&fi->lock);
> + if (test_bit(FUSE_I_INVAL_FILES, &fi->state)) {
> + iput(inode);
> + spin_unlock(&fi->lock);
> + return 0;
> + }
> set_bit(FUSE_I_INVAL_FILES, &fi->state);
> list_for_each_entry(ff, &fi->rw_files, rw_entry) {
> spin_lock(&ff->lock);
> @@ -638,23 +714,14 @@ int fuse_invalidate_files(struct fuse_conn *fc, u64 nodeid)
> for (i = 0; i < FUSE_QHASH_SIZE; i++)
> wake_up_all(&fc->qhash[i].waitq);
>
> - err = filemap_write_and_wait(inode->i_mapping);
> - if (!err || err == -EIO) { /* AS_EIO might trigger -EIO */
> - spin_lock(&fi->lock);
> - list_for_each_entry(ff, &fi->rw_files, rw_entry)
> - fuse_revoke_readpages(ff);
> - spin_unlock(&fi->lock);
> -
> - wake_up(&fi->page_waitq); /* readpage[s] can wait on fuse wb */
> - err = invalidate_inode_pages2(inode->i_mapping);
> - }
> -
> - if (!err)
> - fuse_invalidate_attr(inode);
> + atomic_inc(&fc->num_waiting);
> + spin_lock(&fc->lock);
> + list_add_tail(&fi->inval_files_entry, &fc->inval_files_list);
> + spin_unlock(&fc->lock);
> + if (!queue_work(fuse_inval_files_wq, &fc->inval_files_work))
> + fuse_drop_waiting(fc);
>
> - clear_bit(FUSE_I_INVAL_FILES, &fi->state);
> - iput(inode);
> - return err;
> + return 0;
> }
>
> static void fuse_umount_begin(struct super_block *sb)
> @@ -1308,6 +1375,8 @@ int fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
> if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> fuse_backing_files_init(fc);
>
> + INIT_LIST_HEAD(&fc->inval_files_list);
> + INIT_WORK(&fc->inval_files_work, fuse_inval_files_work);
> INIT_LIST_HEAD(&fc->mounts);
> list_add(&fm->fc_entry, &fc->mounts);
> fm->fc = fc;
> @@ -2456,13 +2525,17 @@ static int __init fuse_fs_init(void)
> {
> int err;
>
> + fuse_inval_files_wq= create_workqueue("fuse_inval_files_wq");
> + if (!fuse_inval_files_wq)
> + goto out;
> +
> fuse_inode_cachep = kmem_cache_create("fuse_inode",
> sizeof(struct fuse_inode), 0,
> SLAB_HWCACHE_ALIGN|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT,
> fuse_inode_init_once);
> err = -ENOMEM;
> if (!fuse_inode_cachep)
> - goto out;
> + goto out1;
>
> err = register_fuseblk();
> if (err)
> @@ -2478,6 +2551,8 @@ static int __init fuse_fs_init(void)
> unregister_fuseblk();
> out2:
> kmem_cache_destroy(fuse_inode_cachep);
> + out1:
> + destroy_workqueue(fuse_inval_files_wq);
> out:
> return err;
> }
> @@ -2493,6 +2568,7 @@ static void fuse_fs_cleanup(void)
> */
> rcu_barrier();
> kmem_cache_destroy(fuse_inode_cachep);
> + destroy_workqueue(fuse_inval_files_wq);
> }
>
> static struct kobject *fuse_kobj;
> diff --git a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
> index eafe2ee2313b..a4c41dc7f60f 100644
> --- a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
> +++ b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
> @@ -158,6 +158,7 @@ MODULE_PARM_DESC(rdmaio_io_failing, "Enable/Disbale RDMA io failing");
>
> static int fuse_ktrace_setup(struct fuse_conn * fc);
> static int fuse_ktrace_remove(struct fuse_conn *fc);
> +static void kfuse_trace(struct fuse_conn *fc, const char *fmt, ...);
>
> static struct kmem_cache *pcs_fuse_req_cachep;
> static struct kmem_cache *pcs_ireq_cachep;
> @@ -1672,6 +1673,8 @@ static int fuse_ktrace_setup(struct fuse_conn * fc)
> goto err;
> }
>
> + fc->fuse_ktrace_fn = kfuse_trace;
> +
> return 0;
>
> err:
> @@ -1680,22 +1683,19 @@ static int fuse_ktrace_setup(struct fuse_conn * fc)
> return ret;
> }
>
> -void __kfuse_trace(struct fuse_conn * fc, unsigned long ip, const char * fmt, ...)
> +static void kfuse_tracer(struct fuse_conn * fc, unsigned long ip, const char *fmt, va_list va)
> {
> - struct fuse_ktrace * tr;
> - va_list va;
> + struct fuse_ktrace *tr;
> int cpu;
>
> cpu = get_cpu();
> tr = fc->ktrace;
> if (tr) {
> u8 * buf = per_cpu_ptr(tr->buf, cpu);
> - struct fuse_trace_hdr * t;
> + struct fuse_trace_hdr *t;
> int len;
>
> - va_start(va, fmt);
> len = vsnprintf(buf, KTRACE_LOG_BUF_SIZE, fmt, va);
> - va_end(va);
> t = fuse_trace_prepare(tr, FUSE_KTRACE_STRING, len + 1);
> if (t)
> memcpy(t + 1, buf, len + 1);
> @@ -1710,6 +1710,26 @@ void __kfuse_trace(struct fuse_conn * fc, unsigned long ip, const char * fmt, ..
> put_cpu();
> }
>
> +void __kfuse_trace(struct fuse_conn * fc, unsigned long ip, const char * fmt, ...)
> +{
> + va_list va;
> +
> + va_start(va, fmt);
> + kfuse_tracer(fc, ip, fmt, va);
> + va_end(va);
> +}
> +
> +static void kfuse_trace(struct fuse_conn *fc, const char *fmt, ...)
> +{
> + va_list va;
> +
> + if (fc->ktrace_level >= LOG_TRACE) {
> + va_start(va, fmt);
> + kfuse_tracer(fc, 0, fmt, va);
> + va_end(va);
> + }
> +}
> +
> void pcs_kio_file_list(struct fuse_conn *fc, kio_file_itr kfile_cb, void *ctx)
> {
> struct fuse_file *ff;
More information about the Devel
mailing list