[CRIU] [RFC PATCH 00/20] CRIU support for ROCm

Felix Kuehling Felix.Kuehling at amd.com
Sat May 1 04:58:25 MSK 2021


A whitepaper describing our design can be found here:
https://github.com/RadeonOpenCompute/criu/blob/criu-dev/test/others/ext-kfd/README.md

Most of the patches are the implementation of our device file plugin
code. We are most interested in feedback on the few patches that modify
core CRIU code. I'm pretty sure we don't know what we're doing here, so
your insights will be appreciated:

01/20 - Treat some unsupported VMAs as regular
03/20 - Add offset and file path plugin
07/20 - Introduce restore late stage hook
20/20 - *RFC* Don't cache fd for amdgpu devices

The corresponding kernel patch series will be discussed on
amd-gfx at lists.freedesktop.org and dri-devel at lists.freedesktop.org. The
KFD patches are also avalailable on github:
https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/commits/fxkamd/criu-wip

This patch series is also on github:
https://github.com/RadeonOpenCompute/criu/commits/criu-dev

David Yat Sin (7):
  criu/plugin: Add support for dumping and restoring queues
  criu/plugin: Support larger memory footprints
  criu/plugin: Dump and restore events
  criu/plugin: Re-adjust doorbell offset for queues
  criu/plugin: Implement system topology parsing
  criu/plugin: Remap GPUs on checkpoint restore
  criu/plugin: Add parameters to override mapping

Rajneesh Bhardwaj (13):
  criu/parse: Treat some unsupported VMAs as regular
  criu/plugin: Initialize AMD KFD header
  criu/files-reg: Add offset and file path plugin
  criu/plugin: Support AMD ROCm Checkpoint Restore with KFD
  criu/plugin: Optimize the proto image size
  criu/plugin: optimization for large bar read
  criu/restore: Introduce restore late stage hook
  criu/plugin: Implement restore late hook for kfd
  criu/plugin: dump debug logs selectively
  criu/plugin: Add initial documentation for ROCm support.
  criu/plugin: Pytorch container with criu
  criu/plugin: Dockerfile for AMD criu repo
  criu/files: *RFC* Don't cache fd for amdgpu devices

 Documentation/Makefile             |    1 +
 Documentation/kfd_plugin.txt       |   79 ++
 criu/cr-restore.c                  |   15 +
 criu/file-ids.c                    |   11 +-
 criu/files-reg.c                   |   18 +
 criu/include/criu-plugin.h         |   12 +
 criu/include/proc_parse.h          |    3 +
 criu/plugin.c                      |    2 +
 criu/proc_parse.c                  |   51 +-
 test/others/ext-kfd/Dockerfile     |   95 ++
 test/others/ext-kfd/Dockerfile.AMD |  114 ++
 test/others/ext-kfd/Makefile       |   13 +
 test/others/ext-kfd/criu-kfd.proto |  107 ++
 test/others/ext-kfd/kfd_ioctl.h    |  692 ++++++++++
 test/others/ext-kfd/kfd_plugin.c   | 1917 ++++++++++++++++++++++++++++
 15 files changed, 3124 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/kfd_plugin.txt
 create mode 100644 test/others/ext-kfd/Dockerfile
 create mode 100644 test/others/ext-kfd/Dockerfile.AMD
 create mode 100644 test/others/ext-kfd/Makefile
 create mode 100644 test/others/ext-kfd/criu-kfd.proto
 create mode 100644 test/others/ext-kfd/kfd_ioctl.h
 create mode 100644 test/others/ext-kfd/kfd_plugin.c

-- 
2.17.1



More information about the CRIU mailing list