[CRIU] [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers

Yonghong Song yhs at fb.com
Thu Nov 18 22:13:02 MSK 2021



On 11/18/21 10:28 AM, Kumar Kartikeya Dwivedi wrote:
> On Thu, Nov 18, 2021 at 10:51:59PM IST, Yonghong Song wrote:
>>
>>
>> On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
>>> This change adds eBPF iterator for buffers registered in io_uring ctx.
>>> It gives access to the ctx, the index of the registered buffer, and a
>>> pointer to the io_uring_ubuf itself. This allows the iterator to save
>>> info related to buffers added to an io_uring instance, that isn't easy
>>> to export using the fdinfo interface (like exact struct page composing
>>> the registered buffer).
>>>
>>> The primary usecase this is enabling is checkpoint/restore support.
>>>
>>> Note that we need to use mutex_trylock when the file is read from, in
>>> seq_start functions, as the order of lock taken is opposite of what it
>>> would be when io_uring operation reads the same file.  We take
>>> seq_file->lock, then ctx->uring_lock, while io_uring would first take
>>> ctx->uring_lock and then seq_file->lock for the same ctx.
>>>
>>> This can lead to a deadlock scenario described below:
>>>
>>>         CPU 0				CPU 1
>>>
>>>         vfs_read
>>>         mutex_lock(&seq_file->lock)	io_read
>>> 					mutex_lock(&ctx->uring_lock)
>>>         mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
>>> 					mutex_lock(&seq_file->lock)
>>
>> It is not clear which mutex_lock switched to mutex_trylock.
> 
> The one in vfs_read.
> 
>>  From below example, it looks like &ctx->uring_lock. But if this is
>> the case, we could have deadlock, right?
>>
> 
> Sorry, I will make the commit message clearer in the next version.
> 
> The sequence on CPU 0 is for normal read(2) on iterator.
> For CPU 1, it is an io_uring instance trying to do same on iterator attached to
> itself.
> 
> So CPU 0 does
> 
> sys_read
> vfs_read
>   bpf_seq_read
>   mutex_lock(&seq_file->lock)    # A
>    io_uring_buf_seq_start
>    mutex_lock(&ctx->uring_lock)  # B
> 
> and CPU 1 does
> 
> io_uring_enter
> mutex_lock(&ctx->uring_lock)    # B
>   io_read
>    bpf_seq_read
>    mutex_lock(&seq_file->lock)   # A
>    ...
> 
> Since the order of locks is opposite, it can deadlock. So I switched the
> mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for
> this case, then it will release seq_file->lock and CPU 1 will make progress.
> 
> The second problem without use of trylock is described below (for same case of
> io_uring reading from iterator attached to itself).
> 
> Let me know if I missed something.

Thanks for explanation. The above diagram is much better.

> 
>>>
>>> The trylock also protects the case where io_uring tries to read from
>>> iterator attached to itself (same ctx), where the order of locks would
>>> be:
>>>    io_uring_enter
>>>     mutex_lock(&ctx->uring_lock) <-----------.
>>>     io_read				    \
>>>      seq_read				     \
>>>       mutex_lock(&seq_file->lock)		     /
>>>       mutex_lock(&ctx->uring_lock) # deadlock-`
>>>
>>> In both these cases (recursive read and contended uring_lock), -EDEADLK
>>> is returned to userspace.
>>>
>>> In the future, this iterator will be extended to directly support
>>> iteration of bvec Flexible Array Member, so that when there is no
>>> corresponding VMA that maps to the registered buffer (e.g. if VMA is
>>> destroyed after pinning pages), we are able to reconstruct the
>>> registration on restore by dumping the page contents and then replaying
>>> them into a temporary mapping used for registration later. All this is
>>> out of scope for the current series however, but builds upon this
>>> iterator.
>>>
>>> Cc: Jens Axboe <axboe at kernel.dk>
>>> Cc: Pavel Begunkov <asml.silence at gmail.com>
>>> Cc: io-uring at vger.kernel.org
>>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor at gmail.com>
>>> ---
>>>    fs/io_uring.c                  | 179 +++++++++++++++++++++++++++++++++
>>>    include/linux/bpf.h            |   2 +
>>>    include/uapi/linux/bpf.h       |   3 +
>>>    tools/include/uapi/linux/bpf.h |   3 +
>>>    4 files changed, 187 insertions(+)
>>>
[...]
> 
>>> [...]
>>> +static struct bpf_iter_reg io_uring_buf_reg_info = {
>>> +	.target            = "io_uring_buf",
>>> +	.feature	   = BPF_ITER_RESCHED,
>>> +	.attach_target     = bpf_io_uring_iter_attach,
>>> +	.detach_target     = bpf_io_uring_iter_detach,
>>
>> Since you have this extra `io_uring_fd` for the iterator, you may want
>> to implement show_fdinfo and fill_link_info callback functions here.
>>
> 
> Ack, but some questions:
> 
> What should it have? e.g. it easy to go from map_id to map fd if one wants
> access to the map attached to the iterator, but not sure how one can obtain more
> information about target fd from io_uring or epoll iterators.

Just to be clear, I am talking about uapi struct bpf_link_info.
I agree that fd is not really useful. So I guess it is up to you
whether you want to show fd to user or not. We can always
add it later if needed.

> 
> Should I delegate to their show_fdinfo and dump using that?
> 
> The type/target is already available in link_info, not sure what other useful
> information can be added there, which allows obtaining the io_uring/epoll fd.
> 
>>> +	.ctx_arg_info_size = 2,
>>> +	.ctx_arg_info = {
>>> +		{ offsetof(struct bpf_iter__io_uring_buf, ctx),
>>> +		  PTR_TO_BTF_ID },
>>> +		{ offsetof(struct bpf_iter__io_uring_buf, ubuf),
>>> +		  PTR_TO_BTF_ID_OR_NULL },
>>> +	},
>>> +	.seq_info	   = &bpf_io_uring_buf_seq_info,
>>> +};
>>> +
>>> +static int __init io_uring_iter_init(void)
>>> +{
>>> +	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
>>> +	io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
>>> +	return bpf_iter_reg_target(&io_uring_buf_reg_info);
>>> +}
>>> +late_initcall(io_uring_iter_init);
>>> +
>>> +#endif
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index 56098c866704..ddb9d4520a3f 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -1509,8 +1509,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>>>    	extern int bpf_iter_ ## target(args);			\
>>>    	int __init bpf_iter_ ## target(args) { return 0; }
>>> +struct io_ring_ctx;
>>>    struct bpf_iter_aux_info {
>>>    	struct bpf_map *map;
>>> +	struct io_ring_ctx *ctx;
>>>    };
>>
>> Can we use union here? Note that below bpf_iter_link_info in
>> uapi/linux/bpf.h, map_fd and io_uring_fd is also an union.
>>
> 
> So the reason I didn't use a union was the link->aux.map check in
> bpf_iter.c::__get_seq_info. Even if we switch to union bpf_iter_aux_info, it
> needs some way to determine whether link is for map type, so maybe a string
> comparison there? Leaving it out of union felt cleaner, also I move both
> io_ring_ctx and eventpoll file into a union in later patch.

I see. the seq_ops for map element iterator is different from others.
the seq_ops is not from main iter registration but from map_ops.

I think your change is okay. But maybe a comment to explain why
map is different from others in struct bpf_iter_aux_info.

> 
>>>    typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 6297eafdc40f..3323defa99a1 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>>>    	struct {
>>>    		__u32	map_fd;
>>>    	} map;
>>> +	struct {
>>> +		__u32   io_uring_fd;
>>> +	} io_uring;
>>>    };
>>>    /* BPF syscall commands, see bpf(2) man-page for more details. */
>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>> index 6297eafdc40f..3323defa99a1 100644
>>> --- a/tools/include/uapi/linux/bpf.h
>>> +++ b/tools/include/uapi/linux/bpf.h
>>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>>>    	struct {
>>>    		__u32	map_fd;
>>>    	} map;
>>> +	struct {
>>> +		__u32   io_uring_fd;
>>> +	} io_uring;
>>>    };
>>>    /* BPF syscall commands, see bpf(2) man-page for more details. */
>>>
> 
> --
> Kartikeya
> 


More information about the CRIU mailing list