[CRIU] [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
Alexei Starovoitov
alexei.starovoitov at gmail.com
Fri Nov 19 01:02:26 MSK 2021
On Tue, Nov 16, 2021 at 11:12:30AM +0530, Kumar Kartikeya Dwivedi wrote:
> This change adds eBPF iterator for buffers registered in io_uring ctx.
> It gives access to the ctx, the index of the registered buffer, and a
> pointer to the io_uring_ubuf itself. This allows the iterator to save
> info related to buffers added to an io_uring instance, that isn't easy
> to export using the fdinfo interface (like exact struct page composing
> the registered buffer).
>
> The primary usecase this is enabling is checkpoint/restore support.
>
> Note that we need to use mutex_trylock when the file is read from, in
> seq_start functions, as the order of lock taken is opposite of what it
> would be when io_uring operation reads the same file. We take
> seq_file->lock, then ctx->uring_lock, while io_uring would first take
> ctx->uring_lock and then seq_file->lock for the same ctx.
>
> This can lead to a deadlock scenario described below:
>
> CPU 0 CPU 1
>
> vfs_read
> mutex_lock(&seq_file->lock) io_read
> mutex_lock(&ctx->uring_lock)
> mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
> mutex_lock(&seq_file->lock)
>
> The trylock also protects the case where io_uring tries to read from
> iterator attached to itself (same ctx), where the order of locks would
> be:
> io_uring_enter
> mutex_lock(&ctx->uring_lock) <-----------.
> io_read \
> seq_read \
> mutex_lock(&seq_file->lock) /
> mutex_lock(&ctx->uring_lock) # deadlock-`
>
> In both these cases (recursive read and contended uring_lock), -EDEADLK
> is returned to userspace.
>
> In the future, this iterator will be extended to directly support
> iteration of bvec Flexible Array Member, so that when there is no
> corresponding VMA that maps to the registered buffer (e.g. if VMA is
> destroyed after pinning pages), we are able to reconstruct the
> registration on restore by dumping the page contents and then replaying
> them into a temporary mapping used for registration later. All this is
> out of scope for the current series however, but builds upon this
> iterator.
>From BPF infra perspective these new iterators fit very well and
I don't see any issues maintaining this interface while kernel keeps
changing, but this commit log and shallowness of the selftests
makes me question feasibility of this approach in particular with io_uring.
Is it even possible to scan all internal bits of io_uring and reconstruct
it later? The bpf iter is only the read part. Don't you need the write part
for CRIU ? Even for reads only... io_uring has complex inner state.
Like bpf itself which cannot be realistically CRIU-ed.
I don't think we can merge this in pieces. We need to wait until there is
full working CRIU framework that uses these new iterators.
More information about the CRIU
mailing list