[CRIU] [RFC PATCH v2 8/9] namespace: add sb_revalidate_bindmounts helper

Tue Apr 4 00:14:23 MSK 2023

On Mon, Apr 03, 2023 at 04:45:16PM +0200, Alexander Mikhalitsyn wrote:
> Useful if for some reason bindmounts root dentries get invalidated
> but it's needed to revalidate existing bindmounts without remounting.
> 
> Cc: Miklos Szeredi <mszeredi at redhat.com>
> Cc: Al Viro <viro at zeniv.linux.org.uk>
> Cc: Amir Goldstein <amir73il at gmail.com>
> Cc: Stéphane Graber <stgraber at ubuntu.com>
> Cc: Seth Forshee <sforshee at kernel.org>
> Cc: Christian Brauner <brauner at kernel.org>
> Cc: Andrei Vagin <avagin at gmail.com>
> Cc: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
> Cc: Bernd Schubert <bschubert at ddn.com>
> Cc: linux-fsdevel at vger.kernel.org
> Cc: linux-kernel at vger.kernel.org
> Cc: criu at openvz.org
> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn at canonical.com>
> ---
>  fs/namespace.c                | 90 +++++++++++++++++++++++++++++++++++
>  include/linux/mnt_namespace.h |  3 ++
>  2 files changed, 93 insertions(+)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index bc0f15257b49..b74d00d6abb0 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -568,6 +568,96 @@ static int mnt_make_readonly(struct mount *mnt)
>  	return ret;
>  }
>  
> +struct bind_mount_list_item {
> +	struct list_head list;
> +	struct vfsmount *mnt;
> +};
> +
> +/*
> + * sb_revalidate_bindmounts - Relookup/reset bindmounts root dentries
> + *
> + * Useful if for some reason bindmount root dentries get invalidated
> + * but it's needed to revalidate existing bindmounts without remounting.
> + */
> +int sb_revalidate_bindmounts(struct super_block *sb)

It's difficult to not strongly dislike this patchset on the basis of the
need for a function like this alone. And I just have to say it even if I
normally try not to do this: This whole function is bonkers in my opinion.

But leaving that aside for a second. This really needs detailed
explanations on locking, assumptions, and an explanation what you're
doing here. This looks crazy to me and definitely isn't fit to be a
generic helper in this form.

> +{
> +	struct mount *mnt;
> +	struct bind_mount_list_item *bmnt, *next;
> +	int err = 0;
> +	struct vfsmount *root_mnt = NULL;
> +	LIST_HEAD(mnt_to_update);
> +	char *buf;
> +
> +	buf = (char *) __get_free_page(GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	lock_mount_hash();
> +	list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
> +		/* we only want to touch bindmounts */
> +		if (mnt->mnt.mnt_root == sb->s_root) {
> +			if (!root_mnt)
> +				root_mnt = mntget(&mnt->mnt);
> +
> +			continue;
> +		}
> +
> +		bmnt = kzalloc(sizeof(struct bind_mount_list_item), GFP_NOWAIT | __GFP_NOWARN);

Allocating under lock_mount_hash() even if doable with this flag
combination should be avoided at all costs imho. That just seems hacky.

> +		if (!bmnt) {
> +			err = -ENOMEM;
> +			goto exit;

You're exiting the function with lock_mount_hash() held...

> +		}
> +
> +		bmnt->mnt = mntget(&mnt->mnt);
> +		list_add_tail(&bmnt->list, &mnt_to_update);
> +	}
> +	unlock_mount_hash();

You've dropped unlock_mount_hash() and the function doesn't hold
namespace_lock() and isn't documented as requiring the caller to hold
it. And the later patch that uses this doesn't afaict (although I
haven't looked at any of the fuse specific stuff). So this is open to
all sorts of races with mount and unmount now. This needs an explanation
why that doesn't matter.

> +
> +	/* TODO: get rid of this limitation */

Confused about this comment.

> +	if (!root_mnt) {
> +		err = -ENOENT;
> +		goto exit;
> +	}
> +
> +	list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {

No one else can access that list so list_for_each_entry_safe() seems
pointless.

> +		struct vfsmount *cur_mnt = bmnt->mnt;
> +		struct path path;
> +		struct dentry *old_root;
> +		char *p;
> +
> +		p = dentry_path(cur_mnt->mnt_root, buf, PAGE_SIZE);
> +		if (IS_ERR(p))
> +			goto exit;
> +
> +		/* TODO: are these lookup flags fully safe and correct? */
> +		err = vfs_path_lookup(root_mnt->mnt_root, root_mnt,
> +				p, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT|LOOKUP_REVAL, &path);
> +		if (err)
> +			goto exit;
> +
> +		/* replace bindmount root dentry */
> +		lock_mount_hash();
> +		old_root = cur_mnt->mnt_root;
> +		cur_mnt->mnt_root = dget(path.dentry);

mnt->mnt_root isn't protected by lock_mount_hash(). It's invariant after
it has been set and a lot of code just assumes that it's stable.

Top of my hat, since you're not holding namespace_lock() mount
propagation can be going on concurrently so propagate_one() can check if
(!is_subdir(mp->m_dentry, m->mnt.mnt_root)) while you're happily
updating it. A lot of code could actually be touching mnt->mnt_root
while you're updating it.

There's probably a lot more issues with this I'm just not seeing without
spending more time on this. But NAK on this as it stands. Sorry.

> +		dput(old_root);
> +		unlock_mount_hash();
> +
> +		path_put(&path);
> +	}
> +
> +exit:
> +	free_page((unsigned long) buf);
> +	mntput(root_mnt);
> +	list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
> +		list_del(&bmnt->list);
> +		mntput(bmnt->mnt);
> +		kfree(bmnt);
> +	}
> +
> +	return err;
> +}
> +EXPORT_SYMBOL_GPL(sb_revalidate_bindmounts);