[CRIU] Query on criu plugin for dumping device files

Bhardwaj, Rajneesh Rajneesh.Bhardwaj at amd.com
Fri Aug 21 04:33:58 MSK 2020


[AMD Public Use]

Hi CRIU team,

Further to initial discussion that happened on this (https://lists.openvz.org/pipermail/criu/2020-June/045030.html) thread I would like to hear some advice from the community on some issues I am facing.
Here is some description of my scenario:
For my current simplified  use case there is a simple test application that opens up amd KFD driver file descriptors using ROCT library via https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/openclose.c#L167. For now I am not creating any user mode queue or allocating any memory on the gpu. Just opening up a device handle and closing it after some time. While my test app is running, I try to dump it using criu. I have also implemented some skeleton code for a corresponding device file plugin in /criu/test/others/ext-kfd/Kfd_plugin.c (with .so copied to /var/lib/criu) which for now implements few (dummy for now) callbacks such as cr_plugin_init, cr_plugin_fini, cr_plugin_dump_file, cr_plugin_restore_file.  From cr_plugin_dump_file callback, I call into KFD driver using the ptrace attached file descriptor that we obtained in cr_plugin_dump_file via a newly implemented ioctl which we intend to use for dumping internal gpu device state/mappings/memory etc and pass it on back to the plugin which is then supposed to save/serialize that data in img files.

Issues:
When I try to dump my test app I am running into two issues:

  1.  During task dumping, criu fails with following errors even before calling into the plugin.
     *   (00.035302) Dumping path for -3 fd via self 12 [/dev/kfd]
(00.035188) Error (criu/proc_parse.c:603): Can't handle non-regular mapping on 41272's map 7f3fcd084000

(00.035325) Error (criu/proc_parse.c:680): Unsupported mapping found 00007f3fcd084000-00007f3fcd085000

(00.035346) Error (criu/cr-dump.c:1248): Collect mappings (pid: 41272) failed with -1

     *   My understanding was that for such device file mappings, we need plugin to handle but even before the plugin is called, we see this fatal error. I tried to skip it but then ran into some other issues elsewhere. Can you please advise how to handle this case since https://github.com/checkpoint-restore/criu/blob/1acfb4c609a70cf2cc4d47c70b47cbe99151ebcd/criu/proc_parse.c#L603 doesn't seem to handle this case well?
  1.  There is one shared mem object and we see failure for it too.
     *   (00.035666) Dumping path for -3 fd via self 12 [/dev/shm/hsakmt_shared_mem]

(00.035706) Only file size could be stored for validation for file /dev/shm/hsakmt_shared_mem

(00.035838) Dumping path for -3 fd via self 12 [/dev/shm/0zONRb (deleted)]

(00.035860) Error (criu/files-reg.c:978): Can't create link remap for /dev/shm/0zONRb (deleted). Use link-remap option.

(00.035878) Error (criu/cr-dump.c:1248): Collect mappings (pid: 35917) failed with -1

When I comment out https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/openclose.c#L217 which creates the shared memory mapping and when I return NULL from https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/fmm.c#L2050 before actual mmap call, I see cr_plugin_dump_file gets called which also calls into kfd driver via new ioctl and entire dumping process is finished successfully.

I am not sure how to deal with vmas associated with device files or shared memory mappings so looking forward to your advice and further suggestions to implement something that's either missing in criu or something else based on my revised understanding of how plugin are supposed to work for device files.

Thanks in advance,
Rajneesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20200821/c1ad8293/attachment-0001.html>


More information about the CRIU mailing list