[Devel] [PATCH VZ10] fs/fuse: order FUSE_OPEN reply against FUSE_NOTIFY_INVAL_FILES

Konstantin Khorenko khorenko at virtuozzo.com
Mon Jun 15 19:16:37 MSK 2026


  1. (Blocking, needs confirmation) The counter is per-fuse_dev, but ordering must be per-fuse_conn

  write_counter lives in struct fuse_dev and starts at 0 for each device. FUSE supports FUSE_DEV_IOC_CLONE
  (fs/fuse/dev.c:2980, fuse_device_clone() at 2825), so a multithreaded daemon typically has many fuse_devs per connection,
  each with an independent counter sequence.

  The comparison in fuse_link_rw_file():

  if (ff->open_ts < fi->inval_files_ts)
      set_bit(FUSE_S_FAIL_IMMEDIATELY, &ff->ff_state);

  mixes a timestamp from the device that delivered the OPEN reply with one from the device that delivered the notification.
  In the common libfuse-style layout, notifications go out on the session's primary fd while replies go out on the
  per-request channel (often a clone) - so these two timestamps come from unrelated counters and the < comparison is
  meaningless. The patch even threads fud all the way into fuse_invalidate_files() specifically to read that device's
  counter, which makes the cross-device mismatch concrete.

  This only works if the vStorage daemon guarantees that the OPEN reply and the INVAL_FILES notification for a given inode
  always traverse the same /dev/fuse instance. That needs to be confirmed explicitly. If it can't be guaranteed, the counter
  should move to struct fuse_conn as an atomic64_t, bumped with atomic64_inc_return() at the top of fuse_dev_do_write().
  That also makes the whole scheme robust regardless of the daemon's threading model.

  2. (Medium) The atomic create+open path never sets ff->open_ts

  Only fuse_send_open() propagates args.reply_ts into ff->open_ts. The FUSE_CREATE path in fuse_create_open()
  (fs/fuse/dir.c:620+) allocates ff via fuse_file_alloc() (kzalloc, so open_ts == 0), sends its own request via
  fuse_simple_idmap_request(), and then reaches fuse_finish_open() → fuse_link_rw_file() with open_ts still 0.

  For a brand-new inode (inval_files_ts == 0) 0 < 0 is false, so no harm. But fuse_iget() can return a reused inode whose
  inval_files_ts > 0 from an earlier invalidation; then a perfectly fresh create+open is 0 < inval_files_ts → spuriously
  flagged FUSE_S_FAIL_IMMEDIATELY, and the caller gets immediate IO errors. The create path should also copy the CREATE
  reply's reply_ts into ff->open_ts.

  3. (Low) write_counter++ is non-atomic

  The increment in fuse_dev_do_write() is a plain ++. It's safe only if each fuse_dev is written by exactly one daemon
  thread. If any deployment shares one fd across threads, the increment races and reply_ts can be torn / reordered relative
  to its own request. An atomic64_inc_return() (per the fix in #1) removes this entirely.

  4. (Low / verify) fuse_finish_open() can now sleep under inode_lock + nowrite

  fuse_finish_open() previously always returned 0; it now calls fuse_wait_on_inval_files() (a wait_on_bit(TASK_KILLABLE))
  for every caller. In fuse_open() this runs while holding inode_lock and after fuse_set_nowrite() for the wb/dax-truncate
  case (fs/fuse/file.c:337-343). fuse_inval_files_work() only revokes readpages and truncates the page cache (no
  nowrite/inode_lock dependency), so I don't see a deadlock - but this is a new blocking point under those locks and is
  worth a deliberate sign-off. The error return is handled correctly: fuse_open() calls fuse_sync_release() on non-zero
  (line 344-345), and err is 0 from the earlier successful fuse_file_io_open() in the FAIL-immediately branch, so no
  uninitialized return.

  5. (Nit) Comment wording

  dev.c: "Monotonic timestamp matching the daemon's reply/notification send order" - it's a write counter, not a timestamp;
  and it matches the kernel's processing order, which equals the daemon's send order only on a single fd. Minor, but the
  wording currently papers over exactly the assumption in issue #1.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 6/5/26 11:18, Liu Kui wrote:
> The userspace daemon's send order of a FUSE_OPEN reply and a
> FUSE_NOTIFY_INVAL_FILES notification can be reversed by the time the
> kernel processes them. The kernel must distinguish the two cases:
> 
>   - reply sent before the notification: the open is stale and must
>     be flagged FUSE_S_FAIL_IMMEDIATELY;
>   - reply sent after the notification: the open is valid but must
>     wait for fuse_inval_files_work() to complete before returning,
>     or the caller hits IO errors on an inode still being invalidated.
> 
> The FUSE_I_INVAL_FILES bit carries no ordering and cannot tell these
> apart. Add a per-fuse_dev write_counter bumped on every reply/
> notification write, and saved to ff->open_ts on OPEN replies and
> fi->inval_file_ts on FUSE_NOTIFY_INVAL_FILES, and compare them in
> fuse_link_rw_file() to flag stale opens. fuse_finish_open() then
> waits on FUSE_I_INVAL_FILES for the remaining case.
> 
> https://virtuozzo.atlassian.net/browse/VSTOR-133093
> Signed-off-by: Liu Kui <kui.liu at virtuozzo.com>
> ---
>  fs/fuse/dev.c    | 17 ++++++++++----
>  fs/fuse/file.c   | 60 +++++++++++++++++++++++++++++++++++-------------
>  fs/fuse/fuse_i.h | 14 ++++++++++-
>  fs/fuse/inode.c  | 25 ++++++++++++--------
>  4 files changed, 85 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 4fcfd644dcf6..873d01163ead 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -2003,9 +2003,10 @@ static int fuse_notify_resend(struct fuse_conn *fc)
>  	return 0;
>  }
>  
> -static int fuse_notify_inval_files(struct fuse_conn *fc, unsigned int size,
> +static int fuse_notify_inval_files(struct fuse_dev *fud, unsigned int size,
>  				   struct fuse_copy_state *cs)
>  {
> +	struct fuse_conn *fc = fud->fc;
>  	struct fuse_notify_inval_files_out outarg;
>  	int err = -EINVAL;
>  
> @@ -2019,7 +2020,7 @@ static int fuse_notify_inval_files(struct fuse_conn *fc, unsigned int size,
>  
>  	down_read(&fc->killsb);
>  
> -	err = fuse_invalidate_files(fc, outarg.ino);
> +	err = fuse_invalidate_files(fud, outarg.ino);
>  
>  	up_read(&fc->killsb);
>  	return err;
> @@ -2029,9 +2030,11 @@ static int fuse_notify_inval_files(struct fuse_conn *fc, unsigned int size,
>  	return err;
>  }
>  
> -static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code,
> +static int fuse_notify(struct fuse_dev *fud, enum fuse_notify_code code,
>  		       unsigned int size, struct fuse_copy_state *cs)
>  {
> +	struct fuse_conn *fc = fud->fc;
> +
>  	/* Don't try to move pages (yet) */
>  	cs->move_pages = 0;
>  
> @@ -2062,7 +2065,7 @@ static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code,
>  		return fuse_notify_resend(fc);
>  
>  	case FUSE_NOTIFY_INVAL_FILES:
> -		return fuse_notify_inval_files(fc, size, cs);
> +		return fuse_notify_inval_files(fud, size, cs);
>  
>  	default:
>  		fuse_copy_finish(cs);
> @@ -2375,6 +2378,9 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
>  	struct fuse_req *req;
>  	struct fuse_out_header oh;
>  
> +	/* Monotonic timestamp matching the daemon's reply/notification send order. */
> +	fud->write_counter++;
> +
>  	err = -EINVAL;
>  	if (nbytes < sizeof(struct fuse_out_header))
>  		goto out;
> @@ -2392,7 +2398,7 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
>  	 * and error contains notification code.
>  	 */
>  	if (!oh.unique) {
> -		err = fuse_notify(fc, oh.error, nbytes - sizeof(oh), cs);
> +		err = fuse_notify(fud, oh.error, nbytes - sizeof(oh), cs);
>  		goto out;
>  	}
>  
> @@ -2462,6 +2468,7 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
>  		list_del_init(&req->list);
>  	spin_unlock(&fpq->lock);
>  
> +	req->args->reply_ts = fud->write_counter;
>  	fuse_request_end(req);
>  out:
>  	return err ? err : nbytes;
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index edea59ee4839..206e24a5dbce 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -45,10 +45,11 @@ static DECLARE_WORK(fuse_fput_work, fuse_fput_routine);
>  
>  static int fuse_send_open(struct fuse_mount *fm, u64 nodeid,
>  			  unsigned int open_flags, int opcode,
> -			  struct fuse_open_out *outargp)
> +			  struct fuse_open_out *outargp, uint64_t *reply_ts)
>  {
>  	struct fuse_open_in inarg;
>  	FUSE_ARGS(args);
> +	int err;
>  
>  	memset(&inarg, 0, sizeof(inarg));
>  	inarg.flags = open_flags & ~(O_CREAT | O_EXCL | O_NOCTTY);
> @@ -69,7 +70,14 @@ static int fuse_send_open(struct fuse_mount *fm, u64 nodeid,
>  	args.out_args[0].size = sizeof(*outargp);
>  	args.out_args[0].value = outargp;
>  
> -	return fuse_simple_request(fm, &args);
> +	err = fuse_simple_request(fm, &args);
> +	/*
> +	 * Propagate the dev write timestamp so fuse_link_rw_file() can order
> +	 * this open against any concurrent FUSE_NOTIFY_INVAL_FILES.
> +	 */
> +	*reply_ts = args.reply_ts;
> +
> +	return err;
>  }
>  
>  struct fuse_file *fuse_file_alloc(struct fuse_mount *fm, bool release)
> @@ -192,7 +200,7 @@ struct fuse_file *fuse_file_open(struct fuse_mount *fm, u64 nodeid,
>  		struct fuse_open_out *outargp = &ff->args->open_outarg;
>  		int err;
>  
> -		err = fuse_send_open(fm, nodeid, open_flags, opcode, outargp);
> +		err = fuse_send_open(fm, nodeid, open_flags, opcode, outargp, &ff->open_ts);
>  		if (!err) {
>  			ff->fh = outargp->fh;
>  			ff->open_flags = outargp->open_flags;
> @@ -245,6 +253,18 @@ static void fuse_link_write_file(struct file *file)
>  	spin_unlock(&fi->lock);
>  }
>  
> +static int fuse_wait_on_inval_files(struct inode *inode)
> +{
> +	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct fuse_conn *fc = get_fuse_mount(inode)->fc;
> +
> +	if (likely(!test_bit(FUSE_I_INVAL_FILES, &fi->state)))
> +		return 0;
> +
> +	fuse_ktrace(fc, "waiting for invalidate_files on [%llu] to complete", fi->nodeid);
> +	return wait_on_bit(&fi->state, FUSE_I_INVAL_FILES, TASK_KILLABLE);
> +}
> +
>  static void fuse_link_rw_file(struct file *file)
>  {
>  	struct inode *inode = file_inode(file);
> @@ -252,14 +272,17 @@ static void fuse_link_rw_file(struct file *file)
>  	struct fuse_file *ff = file->private_data;
>  
>  	spin_lock(&fi->lock);
> -	if (unlikely(test_bit(FUSE_I_INVAL_FILES, &fi->state))) {
> -		spin_lock(&ff->lock);
> -		set_bit(FUSE_S_FAIL_IMMEDIATELY, &ff->ff_state);
> -		spin_unlock(&ff->lock);
> -		fuse_ktrace(ff->fm->fc, "fuse_file[%llu] --> invalidate_file on [%llu] pending", ff->fh, ff->nodeid);
> -	}
>  	if (list_empty(&ff->rw_entry))
>  		list_add(&ff->rw_entry, &fi->rw_files);
> +
> +	/*
> +	 * If this open reply was received before the last
> +	 * FUSE_NOTIFY_INVAL_FILES on this inode, the open is
> +	 * stale, fail it immediately.
> +	 */
> +	if (ff->open_ts < fi->inval_files_ts)
> +		set_bit(FUSE_S_FAIL_IMMEDIATELY, &ff->ff_state);
> +
>  	spin_unlock(&fi->lock);
>  }
>  
> @@ -287,7 +310,15 @@ int fuse_finish_open(struct inode *inode, struct file *file)
>  
>  	fuse_link_rw_file(file);
>  
> -	return 0;
> +	/*
> +	 * Wait for fuse_inval_files_work() on this inode to complete
> +	 * before returning, otherwise the caller can hit IO errors
> +	 * on an inode still being invalidated.
> +	 */
> +	if (!test_bit(FUSE_S_FAIL_IMMEDIATELY, &ff->ff_state))
> +		err = fuse_wait_on_inval_files(inode);
> +
> +	return err;
>  }
>  
>  static void fuse_truncate_update_attr(struct inode *inode, struct file *file)
> @@ -320,12 +351,9 @@ static int fuse_open(struct inode *inode, struct file *file)
>  	if ((file->f_flags & O_DIRECT) && !fc->direct_enable)
>  		return -EINVAL;
>  
> -	if (unlikely(test_bit(FUSE_I_INVAL_FILES, &fi->state))) {
> -		fuse_ktrace(fc, "waiting for invalidate_file on [%llu] to complete", fi->nodeid);
> -		err = wait_on_bit(&fi->state, FUSE_I_INVAL_FILES, TASK_KILLABLE);
> -		if (err)
> -			return err;
> -	}
> +	err = fuse_wait_on_inval_files(inode);
> +	if (err)
> +		return err;
>  
>  	err = generic_file_open(inode, file);
>  	if (err)
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 966c8e6c2ab7..925233427339 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -218,6 +218,9 @@ struct fuse_inode {
>  
>  	/** Entry on fc->inval_files_list list */
>  	struct list_head inval_files_entry;
> +
> +	/** fud->write_counter when the last FUSE_NOTIFY_INVAL_FILES notification was received */
> +	uint64_t inval_files_ts;
>  };
>  
>  /** FUSE inode state bits */
> @@ -315,6 +318,9 @@ struct fuse_file {
>  
>  	/** List of requests that may be killed **/
>  	struct list_head revoke_list;
> +
> +	/** fud->write_counter when the FUSE_OPEN reply was received */
> +	uint64_t open_ts;
>  };
>  
>  /** FUSE file states (ff_state) */
> @@ -379,6 +385,9 @@ struct fuse_args {
>  
>  	/** Fuse file used in the request or NULL*/
>  	struct fuse_file *ff;
> +
> +	/** fud->write_counter when the reply was received */
> +	uint64_t reply_ts;
>  };
>  
>  struct fuse_args_pages {
> @@ -625,6 +634,9 @@ struct fuse_dev {
>  
>  	/** list entry on fc->devices */
>  	struct list_head entry;
> +
> +	/** Monotonic timestamp, bumped on every reply/notification write */
> +	uint64_t write_counter;
>  };
>  
>  enum fuse_dax_mode {
> @@ -1604,7 +1616,7 @@ int fuse_reverse_inval_entry(struct fuse_conn *fc, u64 parent_nodeid,
>   * File-system tells the kernel to invalidate all fuse-files (and cache)
>   * for the given node id.
>   */
> -int fuse_invalidate_files(struct fuse_conn *fc, u64 nodeid);
> +int fuse_invalidate_files(struct fuse_dev *fud, u64 nodeid);
>  
>  int fuse_do_open(struct fuse_mount *fm, u64 nodeid, struct file *file,
>  		 bool isdir);
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 1e0c86f4b37a..dc603c649eef 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -118,6 +118,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
>  	fi->submount_lookup = NULL;
>  	fi->i_size_unstable = 0;
>  	fi->private = NULL;
> +	fi->inval_files_ts = 0;
>  	INIT_LIST_HEAD(&fi->rw_files);
>  	INIT_LIST_HEAD(&fi->inval_files_entry);
>  	mutex_init(&fi->mutex);
> @@ -623,7 +624,7 @@ static void fuse_inval_files_work(struct work_struct *w)
>  
>  		fi = list_first_entry(&inval_files_list, struct fuse_inode, inval_files_entry);
>  		list_del(&fi->inval_files_entry);
> -		fuse_ktrace(fc, "invalidate_file on [%llu] starts", fi->nodeid);
> +		fuse_ktrace(fc, "invalidate_files on [%llu] starts", fi->nodeid);
>  
>  		spin_lock(&fi->lock);
>  		list_for_each_entry(ff, &fi->rw_files, rw_entry)
> @@ -640,15 +641,16 @@ static void fuse_inval_files_work(struct work_struct *w)
>  		wake_up_bit(&fi->state, FUSE_I_INVAL_FILES);
>  		spin_unlock(&fi->lock);
>  
> -		fuse_ktrace(fc, "invalidate_file on [%llu] ends", fi->nodeid);
> +		fuse_ktrace(fc, "invalidate_files on [%llu] ends", fi->nodeid);
>  		iput(&fi->inode);
>  	}
>  
>  	fuse_drop_waiting(fc);
>  }
>  
> -int fuse_invalidate_files(struct fuse_conn *fc, u64 nodeid)
> +int fuse_invalidate_files(struct fuse_dev *fud, u64 nodeid)
>  {
> +	struct fuse_conn *fc = fud->fc;
>  	struct inode *inode;
>  	struct fuse_inode *fi;
>  	struct fuse_file *ff;
> @@ -666,19 +668,23 @@ int fuse_invalidate_files(struct fuse_conn *fc, u64 nodeid)
>  
>  	fi = get_fuse_inode(inode);
>  
> -	/* Mark that invalidate files is in progress */
> +	/*
> +	 * Save this notification's timestamp and fail every fuse_file
> +	 * currently linked on the inode.
> +	 */
>  	spin_lock(&fi->lock);
> +	fi->inval_files_ts = fud->write_counter;
> +	list_for_each_entry(ff, &fi->rw_files, rw_entry) {
> +		spin_lock(&ff->lock);
> +		set_bit(FUSE_S_FAIL_IMMEDIATELY, &ff->ff_state);
> +		spin_unlock(&ff->lock);
> +	}
>  	if (test_bit(FUSE_I_INVAL_FILES, &fi->state)) {
>  		spin_unlock(&fi->lock);
>  		iput(inode);
>  		return 0;
>  	}
>  	set_bit(FUSE_I_INVAL_FILES, &fi->state);
> -	list_for_each_entry(ff, &fi->rw_files, rw_entry) {
> -		spin_lock(&ff->lock);
> -		set_bit(FUSE_S_FAIL_IMMEDIATELY, &ff->ff_state);
> -		spin_unlock(&ff->lock);
> -	}
>  	fi->i_size_unstable = 1;
>  	spin_unlock(&fi->lock);
>  
> @@ -1921,6 +1927,7 @@ struct fuse_dev *fuse_dev_alloc(void)
>  
>  	fud->pq.processing = pq;
>  	fuse_pqueue_init(&fud->pq);
> +	fud->write_counter = 0;
>  
>  	return fud;
>  }



More information about the Devel mailing list