[CRIU] Query on criu plugin for dumping device files
Alexander Mihalicyn
alexander at mihalicyn.com
Sun Aug 23 15:50:19 MSK 2020
Hi, Rajneesh
Please, see answers inline.
On Fri, Aug 21, 2020 at 5:46 AM Bhardwaj, Rajneesh
<Rajneesh.Bhardwaj at amd.com> wrote:
>
> [AMD Public Use]
>
>
> Hi CRIU team,
>
>
>
> Further to initial discussion that happened on this (https://lists.openvz.org/pipermail/criu/2020-June/045030.html) thread I would like to hear some advice from the community on some issues I am facing.
>
> Here is some description of my scenario:
>
> For my current simplified use case there is a simple test application that opens up amd KFD driver file descriptors using ROCT library via https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/openclose.c#L167. For now I am not creating any user mode queue or allocating any memory on the gpu. Just opening up a device handle and closing it after some time. While my test app is running, I try to dump it using criu. I have also implemented some skeleton code for a corresponding device file plugin in /criu/test/others/ext-kfd/Kfd_plugin.c (with .so copied to /var/lib/criu) which for now implements few (dummy for now) callbacks such as cr_plugin_init, cr_plugin_fini, cr_plugin_dump_file, cr_plugin_restore_file. From cr_plugin_dump_file callback, I call into KFD driver using the ptrace attached file descriptor that we obtained in cr_plugin_dump_file via a newly implemented ioctl which we intend to use for dumping internal gpu device state/mappings/memory etc and pass it on back to the plugin which is then supposed to save/serialize that data in img files.
>
>
>
> Issues:
>
> When I try to dump my test app I am running into two issues:
>
> During task dumping, criu fails with following errors even before calling into the plugin.
>
> (00.035302) Dumping path for -3 fd via self 12 [/dev/kfd]
>
> (00.035188) Error (criu/proc_parse.c:603): Can't handle non-regular mapping on 41272's map 7f3fcd084000
>
> (00.035325) Error (criu/proc_parse.c:680): Unsupported mapping found 00007f3fcd084000-00007f3fcd085000
>
> (00.035346) Error (criu/cr-dump.c:1248): Collect mappings (pid: 41272) failed with -1
>
> My understanding was that for such device file mappings, we need plugin to handle but even before the plugin is called, we see this fatal error. I tried to skip it but then ran into some other issues elsewhere. Can you please advise how to handle this case since https://github.com/checkpoint-restore/criu/blob/1acfb4c609a70cf2cc4d47c70b47cbe99151ebcd/criu/proc_parse.c#L603 doesn’t seem to handle this case well?
Of course, here you just need to add some extra plugin hook to add
possibility to add support for new file mappings types. It's not a
problem.
As follows from discussion
https://lists.openvz.org/pipermail/criu/2020-June/045032.html
Pavel already said that you can add needed extra hooks and then
prepare appropriate patchset for CRIU.
Please, also take a look on:
https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/proc_parse.c#L178
here you will also need to properly manage VMA flags. (I think you
just need to determine that particular mapping
is really /dev/kfd mapping (major/minor?..)
>
> There is one shared mem object and we see failure for it too.
>
> (00.035666) Dumping path for -3 fd via self 12 [/dev/shm/hsakmt_shared_mem]
>
> (00.035706) Only file size could be stored for validation for file /dev/shm/hsakmt_shared_mem
>
> (00.035838) Dumping path for -3 fd via self 12 [/dev/shm/0zONRb (deleted)]
>
> (00.035860) Error (criu/files-reg.c:978): Can't create link remap for /dev/shm/0zONRb (deleted). Use link-remap option.
>
> (00.035878) Error (criu/cr-dump.c:1248): Collect mappings (pid: 35917) failed with -1
>
>
>
> When I comment out https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/openclose.c#L217 which creates the shared memory mapping and when I return NULL from https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/fmm.c#L2050 before actual mmap call, I see cr_plugin_dump_file gets called which also calls into kfd driver via new ioctl and entire dumping process is finished successfully.
>
>
Looks like the problem is that you have "ghost shmem file descriptor" :)
Let's see,
fd = shm_open(path, ...)
//...
shm_unlink(path);
// but you have fd descriptor and you vma shm area NOT unmapped.
I'm not sure, but it looks like we don't support this case yet. It
will be cool if you prepare ZDTM reproduce for this problem
and fix. We are ready to help.
Also, I'm dived into code:
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d4b224fafc82decdf3210b68ae763a1f345bf3a1/src/perfctr.c#L139
This line looks very suspicious, because on the error path there must
be "close(shmem_fd)", but instead of this in this code we see just
"shmem_fd = 0".
>
> I am not sure how to deal with vmas associated with device files or shared memory mappings so looking forward to your advice and further suggestions to implement something that’s either missing in criu or something else based on my revised understanding of how plugin are supposed to work for device files.
>
>
Do you have plans to open-source this plugin or will it be closed-source thing?
I think it will be easier at first not to write plugin, but just
modify CRIU code itself as needed, publish sources, and then we can
talk about possible
changes in CRIU plugins design and add needed extra hooks if needed.
>
> Thanks in advance,
>
> Rajneesh
>
> _______________________________________________
> CRIU mailing list
> CRIU at openvz.org
> https://lists.openvz.org/mailman/listinfo/criu
Regards, Alex
More information about the CRIU
mailing list