[CRIU] [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items

Yonghong Song yhs at fb.com
Thu Nov 18 20:50:29 MSK 2021



On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> This patch adds eBPF iterator for epoll items (epitems) registered in an
> epoll instance. It gives access to the eventpoll ctx, and the registered
> epoll item (struct epitem). This allows the iterator to inspect the
> registered file and be able to use others iterators to associate it with
> a task's fdtable.
> 
> The primary usecase this is enabling is expediting existing eventpoll
> checkpoint/restore support in the CRIU project. This iterator allows us
> to switch from a worst case O(n^2) algorithm to a single O(n) pass over
> task and epoll registered descriptors.
> 
> We also make sure we're iterating over a live file, one that is not
> going away. The case we're concerned about is a file that has its
> f_count as zero, but is waiting for iterator bpf_seq_read to release
> ep->mtx, so that it can remove its epitem. Since such a file will
> disappear once iteration is done, and it is being destructed, we use
> get_file_rcu to ensure it is alive when invoking the BPF program.
> 
> Getting access to a file that is going to disappear after iteration
> is not useful anyway. This does have a performance overhead however
> (since file reference will be raised and dropped for each file).
> 
> The rcu_read_lock around get_file_rcu isn't strictly required for
> lifetime management since fput path is serialized on ep->mtx to call
> ep_remove, hence the epi->ffd.file pointer remains stable during our
> seq_start/seq_stop bracketing.
> 
> To be able to continue back from the position we were iterating, we
> store the epi->ffi.fd and use ep_find_tfd to find the target file again.
> It would be more appropriate to use both struct file pointer and fd
> number to find the last file, but see below for why that cannot be done.
> 
> Taking reference to struct file and walking RB-Tree to find it again
> will lead to reference cycle issue if the iterator after partial read
> takes reference to socket which later is used in creating a descriptor
> cycle using SCM_RIGHTS. An example that was encountered when working on
> this is mentioned below.
> 
>    Let there be Unix sockets SK1, SK2, epoll fd EP, and epoll iterator
>    ITER.
>    Let SK1 be registered in EP, then on a partial read it is possible
>    that ITER returns from read and takes reference to SK1 to be able to
>    find it later in RB-Tree and continue the iteration.  If SK1 sends
>    ITER over to SK2 using SCM_RIGHTS, and SK2 sends SK2 over to SK1 using
>    SCM_RIGHTS, and both fds are not consumed on the corresponding receive
>    ends, a cycle is created.  When all of SK1, SK2, EP, and ITER are
>    closed, SK1's receive queue holds reference to SK2, and SK2's receive
>    queue holds reference to ITER, which holds a reference to SK1.
>    All file descriptors except EP leak.
> 
> To resolve it, we would need to hook into the Unix Socket GC mechanism,
> but the alternative of using ep_find_tfd is much more simpler. The
> finding of the last position in face of concurrent modification of the
> epoll set is at best an approximation anyway. For the case of CRIU, the
> epoll set remains stable.
> 
> Cc: Alexander Viro <viro at zeniv.linux.org.uk>
> Cc: linux-fsdevel at vger.kernel.org
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor at gmail.com>
> ---
>   fs/eventpoll.c                 | 196 ++++++++++++++++++++++++++++++++-
>   include/linux/bpf.h            |   5 +-
>   include/uapi/linux/bpf.h       |   3 +
>   tools/include/uapi/linux/bpf.h |   3 +
>   4 files changed, 205 insertions(+), 2 deletions(-)
> 
[...]
> +
> +static const struct bpf_iter_seq_info bpf_epoll_seq_info = {
> +	.seq_ops          = &bpf_epoll_seq_ops,
> +	.init_seq_private = bpf_epoll_init_seq,
> +	.seq_priv_size    = sizeof(struct bpf_epoll_iter_seq_info),
> +};
> +
> +static struct bpf_iter_reg epoll_reg_info = {
> +	.target            = "epoll",
> +	.feature           = BPF_ITER_RESCHED,
> +	.attach_target     = bpf_epoll_iter_attach,
> +	.detach_target     = bpf_epoll_iter_detach,

implement show_fdinfo and fill_link_info?

There are some bpftool work needed as well to show the information
in user space. An example is e60495eafdba ("bpftool: Implement 
link_query for bpf iterators").


> +	.ctx_arg_info_size = 2,
> +	.ctx_arg_info = {
> +		{ offsetof(struct bpf_iter__epoll, ep),
> +		  PTR_TO_BTF_ID },
> +		{ offsetof(struct bpf_iter__epoll, epi),
> +		  PTR_TO_BTF_ID_OR_NULL },
> +	},
> +	.seq_info	   = &bpf_epoll_seq_info,
> +};
> +
> +static int __init epoll_iter_init(void)
> +{
> +	epoll_reg_info.ctx_arg_info[0].btf_id = btf_epoll_ids[0];
> +	epoll_reg_info.ctx_arg_info[1].btf_id = btf_epoll_ids[1];
> +	return bpf_iter_reg_target(&epoll_reg_info);
> +}
> +late_initcall(epoll_iter_init);
> +
> +#endif
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index fe7b499da781..eb1c9acdc40b 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1512,7 +1512,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>   struct io_ring_ctx;
>   struct bpf_iter_aux_info {
>   	struct bpf_map *map;
> -	struct io_ring_ctx *ctx;
> +	union {
> +		struct io_ring_ctx *ctx;
> +		struct file *ep;
> +	};

You changed to union here. I think we can change
"struct bpf_iter_aux_info" to "union bpf_iter_aux_info".
This should make code simpler and easy to understand.

>   };
>   
>   typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b70e9da3d722..64e18c1dcfca 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -94,6 +94,9 @@ union bpf_iter_link_info {
>   	struct {
>   		__u32   io_uring_fd;
>   	} io_uring;
> +	struct {
> +		__u32   epoll_fd;
> +	} epoll;
>   };
>   
>   /* BPF syscall commands, see bpf(2) man-page for more details. */
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index b70e9da3d722..64e18c1dcfca 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -94,6 +94,9 @@ union bpf_iter_link_info {
>   	struct {
>   		__u32   io_uring_fd;
>   	} io_uring;
> +	struct {
> +		__u32   epoll_fd;
> +	} epoll;
>   };
>   
>   /* BPF syscall commands, see bpf(2) man-page for more details. */
> 


More information about the CRIU mailing list