[CRIU] Looking into checkpoint/restore of ROCm applications

Felix Kuehling felix.kuehling at gmail.com
Thu Jun 18 19:46:06 MSK 2020


Am 2020-06-16 um 4:58 a.m. schrieb Pavel Emelyanov:
> Hi, Felix
>
> Please, see my comments inline.

Thank you for your reply. I'm responding inline ...


>
>
> вт, 16 июн. 2020 г., 4:04 Felix Kuehling <felix.kuehling at gmail.com
> <mailto:felix.kuehling at gmail.com>>:
>
>     Hi all,
>
>     I'm investigating the possibility of making CRIU work with ROCm,
>     the AMD Radeon Open Compute Platform. I need some advice, but I'll
>     give you some background first.
>
>     ROCm uses the /dev/kfd device as well as /dev/dri/renderD*. I'm
>     planning to do most of the state saving using /dev/kfd with a
>     cr_plugin_dump_file callback in a plugin. I've spent some time
>     reading documentation on criu.org <http://criu.org> and also CRIU
>     source code. At this point I believe I have a fairly good
>     understanding of the low level details of saving kernel mode state
>     associated with ROCm processes.
>
>     I have more trouble with restoring the state. The main issue is
>     the way KFD maps system memory for device access using HMM (or
>     get_user pages and MMU notifiers with DKMS on older kernels). This
>     requires the VMAs to be at the expected virtual addresses before
>     we try to mirror them into the GPU page table.
>
>
> What if the system memory of this area is shared between several
> processes? And mapped in all of them in different virtual address.
> Presumably the requirement is just to have them mapped at correct
> virtual address of the caller process?

Yes. The GPUs support per-process virtual address spaces and each
process mirrors its own virtual memory mappings to its own GPU virtual
address space. Whether or not the mappings share the same physical pages
should not matter.


>
>     Resuming execution on the GPU also needs to be delayed until after
>     the GPU memory mappings have been restored.
>
>     At the time of the cr_plugin_restore_file callback, the VMAs are
>     not at the right place in the restored process, so this is too
>     early to restore the GPU memory mappings.
>
> True. At this point some vmas are in so called premapped area and some
> don't yet exist.
>
>     I can send the mappings and their properties to KFD but KFD needs
>     to wait for some later trigger event before it activates the
>     mappings and their MMU notifiers.
>
>     So this is my question: What would be a good trigger event to
>     indicate that VMAs have been moved to their proper location by the
>     restorer parasite code?
>
> We have "restore stages" that are used to synchronize all the
> processes at specific points. The last 3 refer to places where all the
> memory is in needed virtual address. If you wait for a stage to
> complete and don't start the next one, then you can safely execute
> whatever is needed with all tasks' memory being mapped at final
> virtual address.

I see. I'll need some more time to understand all the stages and the
details of the synchronization mechanism. But I've already seen that
there is synchronization in the restorer PIE code. So that's very
encouraging.


>     I have considered two possibilities that will not work. I'm hoping
>     you can give me some better ideas:
>
>       * cr_plugin_fini
>           o Doesn't get called in all the child processes, not sure if
>             there is synchronization with the child processes' restore
>             completion
>
> Yes, it's called in the master process after all stages completed.
> It's a cleanup hook.
>
>          o
>
>
>       * An MMU notifier on the munmap of the restorer parasite blob itself
>           o In cr_plugin_restore_file this address is not known yet
>
> Can the restoring code run in criu main process instead of one of the
> child ones? If yes, this could make things simpler. You can add yet
> another plugin invocations near apply_memfd_seal, this is the place
> where all child processes are stopped with their vmas properly
> restored. But again, this place runs in the context of criu master
> process , not children.
Yes, I think I can make that work. I'll need to identify the target
process to resume in an ioctl call. To establish authentication to allow
such operations, I'd use ptrace attached status, so that only a
ptrace-attached parent process is allowed to save/restore state of a
child process. Is the master process ptrace-attached to the child
processes at the apply_memfd_seal stage?

Am I understanding you correctly, you're suggesting a change to CRIU
itself to add a new plugin callback to enable my use case?

Thank you,
  Felix


>     I noticed that the child processes are resumed through sigreturn.
>     I'm not familiar with this mechanism. Does this mean there is some
>     signal I may be able to intercept just before execution of the
>     child process resumes?
>
> No, sigreturn is just the mechanics we use to restore tasks registers :)
>
> -- Pavel
>
>     Thank you in advance for your insights.
>
>     Best regards,
>       Felix
>
>
>     _______________________________________________
>     CRIU mailing list
>     CRIU at openvz.org <mailto:CRIU at openvz.org>
>     https://lists.openvz.org/mailman/listinfo/criu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20200618/3c2a4bd6/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openvz.org/pipermail/criu/attachments/20200618/3c2a4bd6/attachment.sig>


More information about the CRIU mailing list