[CRIU] [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items
Yonghong Song
yhs at fb.com
Thu Nov 18 20:50:29 MSK 2021
On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> This patch adds eBPF iterator for epoll items (epitems) registered in an
> epoll instance. It gives access to the eventpoll ctx, and the registered
> epoll item (struct epitem). This allows the iterator to inspect the
> registered file and be able to use others iterators to associate it with
> a task's fdtable.
>
> The primary usecase this is enabling is expediting existing eventpoll
> checkpoint/restore support in the CRIU project. This iterator allows us
> to switch from a worst case O(n^2) algorithm to a single O(n) pass over
> task and epoll registered descriptors.
>
> We also make sure we're iterating over a live file, one that is not
> going away. The case we're concerned about is a file that has its
> f_count as zero, but is waiting for iterator bpf_seq_read to release
> ep->mtx, so that it can remove its epitem. Since such a file will
> disappear once iteration is done, and it is being destructed, we use
> get_file_rcu to ensure it is alive when invoking the BPF program.
>
> Getting access to a file that is going to disappear after iteration
> is not useful anyway. This does have a performance overhead however
> (since file reference will be raised and dropped for each file).
>
> The rcu_read_lock around get_file_rcu isn't strictly required for
> lifetime management since fput path is serialized on ep->mtx to call
> ep_remove, hence the epi->ffd.file pointer remains stable during our
> seq_start/seq_stop bracketing.
>
> To be able to continue back from the position we were iterating, we
> store the epi->ffi.fd and use ep_find_tfd to find the target file again.
> It would be more appropriate to use both struct file pointer and fd
> number to find the last file, but see below for why that cannot be done.
>
> Taking reference to struct file and walking RB-Tree to find it again
> will lead to reference cycle issue if the iterator after partial read
> takes reference to socket which later is used in creating a descriptor
> cycle using SCM_RIGHTS. An example that was encountered when working on
> this is mentioned below.
>
> Let there be Unix sockets SK1, SK2, epoll fd EP, and epoll iterator
> ITER.
> Let SK1 be registered in EP, then on a partial read it is possible
> that ITER returns from read and takes reference to SK1 to be able to
> find it later in RB-Tree and continue the iteration. If SK1 sends
> ITER over to SK2 using SCM_RIGHTS, and SK2 sends SK2 over to SK1 using
> SCM_RIGHTS, and both fds are not consumed on the corresponding receive
> ends, a cycle is created. When all of SK1, SK2, EP, and ITER are
> closed, SK1's receive queue holds reference to SK2, and SK2's receive
> queue holds reference to ITER, which holds a reference to SK1.
> All file descriptors except EP leak.
>
> To resolve it, we would need to hook into the Unix Socket GC mechanism,
> but the alternative of using ep_find_tfd is much more simpler. The
> finding of the last position in face of concurrent modification of the
> epoll set is at best an approximation anyway. For the case of CRIU, the
> epoll set remains stable.
>
> Cc: Alexander Viro <viro at zeniv.linux.org.uk>
> Cc: linux-fsdevel at vger.kernel.org
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor at gmail.com>
> ---
> fs/eventpoll.c | 196 ++++++++++++++++++++++++++++++++-
> include/linux/bpf.h | 5 +-
> include/uapi/linux/bpf.h | 3 +
> tools/include/uapi/linux/bpf.h | 3 +
> 4 files changed, 205 insertions(+), 2 deletions(-)
>
[...]
> +
> +static const struct bpf_iter_seq_info bpf_epoll_seq_info = {
> + .seq_ops = &bpf_epoll_seq_ops,
> + .init_seq_private = bpf_epoll_init_seq,
> + .seq_priv_size = sizeof(struct bpf_epoll_iter_seq_info),
> +};
> +
> +static struct bpf_iter_reg epoll_reg_info = {
> + .target = "epoll",
> + .feature = BPF_ITER_RESCHED,
> + .attach_target = bpf_epoll_iter_attach,
> + .detach_target = bpf_epoll_iter_detach,
implement show_fdinfo and fill_link_info?
There are some bpftool work needed as well to show the information
in user space. An example is e60495eafdba ("bpftool: Implement
link_query for bpf iterators").
> + .ctx_arg_info_size = 2,
> + .ctx_arg_info = {
> + { offsetof(struct bpf_iter__epoll, ep),
> + PTR_TO_BTF_ID },
> + { offsetof(struct bpf_iter__epoll, epi),
> + PTR_TO_BTF_ID_OR_NULL },
> + },
> + .seq_info = &bpf_epoll_seq_info,
> +};
> +
> +static int __init epoll_iter_init(void)
> +{
> + epoll_reg_info.ctx_arg_info[0].btf_id = btf_epoll_ids[0];
> + epoll_reg_info.ctx_arg_info[1].btf_id = btf_epoll_ids[1];
> + return bpf_iter_reg_target(&epoll_reg_info);
> +}
> +late_initcall(epoll_iter_init);
> +
> +#endif
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index fe7b499da781..eb1c9acdc40b 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1512,7 +1512,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
> struct io_ring_ctx;
> struct bpf_iter_aux_info {
> struct bpf_map *map;
> - struct io_ring_ctx *ctx;
> + union {
> + struct io_ring_ctx *ctx;
> + struct file *ep;
> + };
You changed to union here. I think we can change
"struct bpf_iter_aux_info" to "union bpf_iter_aux_info".
This should make code simpler and easy to understand.
> };
>
> typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b70e9da3d722..64e18c1dcfca 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -94,6 +94,9 @@ union bpf_iter_link_info {
> struct {
> __u32 io_uring_fd;
> } io_uring;
> + struct {
> + __u32 epoll_fd;
> + } epoll;
> };
>
> /* BPF syscall commands, see bpf(2) man-page for more details. */
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index b70e9da3d722..64e18c1dcfca 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -94,6 +94,9 @@ union bpf_iter_link_info {
> struct {
> __u32 io_uring_fd;
> } io_uring;
> + struct {
> + __u32 epoll_fd;
> + } epoll;
> };
>
> /* BPF syscall commands, see bpf(2) man-page for more details. */
>
More information about the CRIU
mailing list