[CRIU] Proposal to add CRIU support to DRM render nodes
Felix Kuehling
felix.kuehling at amd.com
Thu Dec 7 00:23:21 MSK 2023
Executive Summary: We need to add CRIU support to DRM render nodes in
order to maintain CRIU support for ROCm application once they start
relying on render nodes for more GPU memory management. In this email
I'm providing some background why we are doing this, and outlining some
of the problems we need to solve to checkpoint and restore render node
state and shared memory (DMABuf) state. I have some thoughts on the API
design, leaning on what we did for KFD, but would like to get feedback
from the DRI community regarding that API and to what extent there is
interest in making that generic.
We are working on using DRM render nodes for virtual address mappings in
ROCm applications to implement the CUDA11-style VM API and improve
interoperability between graphics and compute. This uses DMABufs for
sharing buffer objects between KFD and multiple render node devices, as
well as between processes. In the long run this also provides a path to
moving all or most memory management from the KFD ioctl API to libdrm.
Once ROCm user mode starts using render nodes for virtual address
management, that creates a problem for checkpointing and restoring ROCm
applications with CRIU. Currently there is no support for checkpointing
and restoring render node state, other than CPU virtual address
mappings. Support will be needed for checkpointing GEM buffer objects
and handles, their GPU virtual address mappings and memory sharing
relationships between devices and processes.
Eventually, if full CRIU support for graphics applications is desired,
more state would need to be captured, including scheduler contexts and
BO lists. Most of this state is driver-specific.
After some internal discussions we decided to take our design process
public as this potentially touches DRM GEM and DMABuf APIs and may have
implications for other drivers in the future.
One basic question before going into any API details: Is there a desire
to have CRIU support for other DRM drivers?
With that out of the way, some considerations for a possible DRM CRIU
API (either generic of AMDGPU driver specific): The API goes through
several phases during checkpoint and restore:
Checkpoint:
1. Process-info (enumerates objects and sizes so user mode can allocate
memory for the checkpoint, stops execution on the GPU)
2. Checkpoint (store object metadata for BOs, queues, etc.)
3. Unpause (resumes execution after the checkpoint is complete)
Restore:
1. Restore (restore objects, VMAs are not in the right place at this time)
2. Resume (final fixups after the VMAs are sorted out, resume execution)
For some more background about our implementation in KFD, you can refer
to this whitepaper:
https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
Potential objections to a KFD-style CRIU API in DRM render nodes, I'll
address each of them in more detail below:
* Opaque information in the checkpoint data that user mode can't
interpret or do anything with
* A second API for creating objects (e.g. BOs) that is separate from
the regular BO creation API
* Kernel mode would need to be involved in restoring BO sharing
relationships rather than replaying BO creation, export and import
from user mode
# Opaque information in the checkpoint
This comes out of ABI compatibility considerations. Adding any new
objects or attributes to the driver/HW state that needs to be
checkpointed could potentially break the ABI of the CRIU
checkpoint/restore ioctl if the plugin needs to parse that information.
Therefore, much of the information in our KFD CRIU ioctl API is opaque.
It is written by kernel mode in the checkpoint, it is consumed by kernel
mode when restoring the checkpoint, but user mode doesn't care about the
contents or binary layout, so there is no user mode ABI to break. This
is how we were able to maintain CRIU support when we added the SVM API
to KFD without changing the CRIU plugin and without breaking our ABI.
Opaque information may also lend itself to API abstraction, if this
becomes a generic DRM API with driver-specific callbacks that fill in
HW-specific opaque data.
# Second API for creating objects
Creating BOs and other objects when restoring a checkpoint needs more
information than the usual BO alloc and similar APIs provide. For
example, we need to restore BOs with the same GEM handles so that user
mode can continue using those handles after resuming execution. If BOs
are shared through DMABufs without dynamic attachment, we need to
restore pinned BOs as pinned. Validation of virtual addresses and
handling MMU notifiers must be suspended until the virtual address space
is restored. For user mode queues we need to save and restore a lot of
queue execution state so that execution can resume cleanly.
# Restoring buffer sharing relationships
Different GEM handles in different render nodes and processes can refer
to the same underlying shared memory, either by directly pointing to the
same GEM object, or by creating an import attachment that may get its SG
tables invalidated and updated dynamically through dynamic attachment
callbacks. In the latter case it's obvious, who is the exporter and who
is the importer. In the first case, either one could be the exporter,
and it's not clear who would need to create the BO and who would need to
import it when restoring the checkpoint. To further complicate things,
multiple processes in a checkpoint get restored concurrently. So there
is no guarantee that an exporter has restored a shared BO at the time an
importer is trying to restore its import.
A proposal to deal with these problems would be to treat importers and
exporters the same. Whoever restores first, ends up creating the BO and
potentially attaching to it. The other process(es) can find BOs that
were already restored by another process by looking it up with a unique
ID that could be based on the DMABuf inode number. An alternative would
be a two-pass approach that needs to restore BOs on two passes:
1. Restore exported BOs
2. Restore imports
With some inter-process synchronization in CRIU itself between these two
passes. This may require changes in the core CRIU, outside our plugin.
Both approaches depend on identifying BOs with some unique ID that could
be based on the DMABuf inode number in the checkpoint. However, we would
need to identify the processes in the same restore session, possibly
based on parent/child process relationships, to create a scope where
those IDs are valid during restore.
Finally, we would also need to checkpoint and restore DMABuf file
descriptors themselves. These are anonymous file descriptors. The CRIU
plugin could probably be taught to recreate them from the original
exported BO based on the inode number that could be queried with fstat
in the checkpoint. It would need help from the render node CRIU API to
find the right BO from the inode, which may be from a different process
in the same restore session.
Regards,
Felix
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20231206/a5c462dc/attachment.html>
More information about the CRIU
mailing list