[Devel] [PATCH] inotify: Extend ioctl to allow to request id of new watch descriptor
Kirill Tkhai
ktkhai at virtuozzo.com
Wed Feb 14 14:34:53 MSK 2018
https://jira.sw.ru/browse/PSBM-81411
On 14.02.2018 14:34, Kirill Tkhai wrote:
> [This patch is accepted in ms and is going to main tree.
> As there is no idr_set_cursor() in 3.10 I've directly
> used data->idr.cur on port].
>
> Watch descriptor is id of the watch created by inotify_add_watch().
> It is allocated in inotify_add_to_idr(), and takes the numbers
> starting from 1. Every new inotify watch obtains next available
> number (usually, old + 1), as served by idr_alloc_cyclic().
>
> CRIU (Checkpoint/Restore In Userspace) project supports inotify
> files, and restores watched descriptors with the same numbers,
> they had before dump. Since there was no kernel support, we
> had to use cycle to add a watch with specific descriptor id:
>
> while (1) {
> int wd;
>
> wd = inotify_add_watch(inotify_fd, path, mask);
> if (wd < 0) {
> break;
> } else if (wd == desired_wd_id) {
> ret = 0;
> break;
> }
>
> inotify_rm_watch(inotify_fd, wd);
> }
>
> (You may find the actual code at the below link:
> https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577)
>
> The cycle is suboptiomal and very expensive, but since there is no better
> kernel support, it was the only way to restore that. Happily, we had met
> mostly descriptors with small id, and this approach had worked somehow.
>
> But recent time containers with inotify with big watch descriptors
> begun to come, and this way stopped to work at all. When descriptor id
> is something about 0x34d71d6, the restoring process spins in busy loop
> for a long time, and the restore hungs and delay of migration from node
> to node could easily be watched.
>
> This patch aims to solve this problem. It introduces new ioctl
> INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created
> watch descriptor from userspace. It simply calls idr_set_cursor() primitive
> to populate idr::idr_next, so that next idr_alloc_cyclic() allocation
> will return this id, if it is not occupied. This is the way which is
> used to restore some other resources from userspace. For example,
> /proc/sys/kernel/ns_last_pid works the same for task pids.
>
> The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system
> may exclude it.
>
> Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
> Reviewed-by: Cyrill Gorcunov <gorcunov at openvz.org>
> Reviewed-by: Andrew Morton <akpm at linux-foundation.org>
> ---
> fs/notify/inotify/inotify_user.c | 14 ++++++++++++++
> include/uapi/linux/inotify.h | 8 ++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> index cc39a5d84a0e..fe93507e4d98 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -309,6 +309,20 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
> spin_unlock(&group->notification_lock);
> ret = put_user(send_len, (int __user *) p);
> break;
> +#ifdef CONFIG_CHECKPOINT_RESTORE
> + case INOTIFY_IOC_SETNEXTWD:
> + ret = -EINVAL;
> + if (arg >= 1 && arg <= INT_MAX) {
> + struct inotify_group_private_data *data;
> +
> + data = &group->inotify_data;
> + spin_lock(&data->idr_lock);
> + data->idr.cur = (unsigned int)arg;
> + spin_unlock(&data->idr_lock);
> + ret = 0;
> + }
> + break;
> +#endif /* CONFIG_CHECKPOINT_RESTORE */
> }
>
> return ret;
> diff --git a/include/uapi/linux/inotify.h b/include/uapi/linux/inotify.h
> index e6bf35b2dd34..ce8ac99480fa 100644
> --- a/include/uapi/linux/inotify.h
> +++ b/include/uapi/linux/inotify.h
> @@ -70,5 +70,13 @@ struct inotify_event {
> #define IN_CLOEXEC O_CLOEXEC
> #define IN_NONBLOCK O_NONBLOCK
>
> +/*
> + * ioctl numbers: inotify uses 'I' prefix for all ioctls,
> + * except historical FIONREAD, which is based on 'T'.
> + *
> + * INOTIFY_IOC_SETNEXTWD: set desired number of next created
> + * watch descriptor.
> + */
> +#define INOTIFY_IOC_SETNEXTWD _IOW('I', 0, __s32)
>
> #endif /* _UAPI_LINUX_INOTIFY_H */
>
More information about the Devel
mailing list