[CRIU] Looking into checkpoint/restore of ROCm applications

Thu Jun 18 21:03:51 MSK 2020

чт, 18 июн. 2020 г. в 19:46, Felix Kuehling <felix.kuehling at gmail.com>:

>
> Am 2020-06-16 um 4:58 a.m. schrieb Pavel Emelyanov:
>
> Hi, Felix
>
> Please, see my comments inline.
>
> Thank you for your reply. I'm responding inline ...
>
>
>
>
> вт, 16 июн. 2020 г., 4:04 Felix Kuehling <felix.kuehling at gmail.com>:
>
>> Hi all,
>>
>> I'm investigating the possibility of making CRIU work with ROCm, the AMD
>> Radeon Open Compute Platform. I need some advice, but I'll give you some
>> background first.
>>
>> ROCm uses the /dev/kfd device as well as /dev/dri/renderD*. I'm planning
>> to do most of the state saving using /dev/kfd with a cr_plugin_dump_file
>> callback in a plugin. I've spent some time reading documentation on
>> criu.org and also CRIU source code. At this point I believe I have a
>> fairly good understanding of the low level details of saving kernel mode
>> state associated with ROCm processes.
>>
>> I have more trouble with restoring the state. The main issue is the way
>> KFD maps system memory for device access using HMM (or get_user pages and
>> MMU notifiers with DKMS on older kernels). This requires the VMAs to be at
>> the expected virtual addresses before we try to mirror them into the GPU
>> page table.
>>
>
> What if the system memory of this area is shared between several
> processes? And mapped in all of them in different virtual address.
> Presumably the requirement is just to have them mapped at correct virtual
> address of the caller process?
>
> Yes. The GPUs support per-process virtual address spaces and each process
> mirrors its own virtual memory mappings to its own GPU virtual address
> space. Whether or not the mappings share the same physical pages should not
> matter.
>
>
>
> Resuming execution on the GPU also needs to be delayed until after the GPU
>> memory mappings have been restored.
>>
>> At the time of the cr_plugin_restore_file callback, the VMAs are not at
>> the right place in the restored process, so this is too early to restore
>> the GPU memory mappings.
>>
> True. At this point some vmas are in so called premapped area and some
> don't yet exist.
>
>> I can send the mappings and their properties to KFD but KFD needs to wait
>> for some later trigger event before it activates the mappings and their MMU
>> notifiers.
>>
>> So this is my question: What would be a good trigger event to indicate
>> that VMAs have been moved to their proper location by the restorer parasite
>> code?
>>
> We have "restore stages" that are used to synchronize all the processes at
> specific points. The last 3 refer to places where all the memory is in
> needed virtual address. If you wait for a stage to complete and don't start
> the next one, then you can safely execute whatever is needed with all
> tasks' memory being mapped at final virtual address.
>
> I see. I'll need some more time to understand all the stages and the
> details of the synchronization mechanism. But I've already seen that there
> is synchronization in the restorer PIE code. So that's very encouraging.
>
>
> I have considered two possibilities that will not work. I'm hoping you can
>> give me some better ideas:
>>
>>    - cr_plugin_fini
>>       - Doesn't get called in all the child processes, not sure if there
>>       is synchronization with the child processes' restore completion
>>
>> Yes, it's called in the master process after all stages completed. It's a
> cleanup hook.
>
>>
>>    -
>>       - An MMU notifier on the munmap of the restorer parasite blob
>>    itself
>>       - In cr_plugin_restore_file this address is not known yet
>>
>> Can the restoring code run in criu main process instead of one of the
> child ones? If yes, this could make things simpler. You can add yet another
> plugin invocations near apply_memfd_seal, this is the place where all child
> processes are stopped with their vmas properly restored. But again, this
> place runs in the context of criu master process , not children.
>
> Yes, I think I can make that work. I'll need to identify the target
> process to resume in an ioctl call. To establish authentication to allow
> such operations, I'd use ptrace attached status, so that only a
> ptrace-attached parent process is allowed to save/restore state of a child
> process. Is the master process ptrace-attached to the child processes at
> the apply_memfd_seal stage?
>

No, but that time no. On restore the ptrace takes part only close to the
very-very end.

> Am I understanding you correctly, you're suggesting a change to CRIU
> itself to add a new plugin callback to enable my use case?
>
If it's properly justified and sanely coded, then yes.

-- Pavel

> Thank you,
>   Felix
>
>
>
>>
>> I noticed that the child processes are resumed through sigreturn. I'm not
>> familiar with this mechanism. Does this mean there is some signal I may be
>> able to intercept just before execution of the child process resumes?
>>
> No, sigreturn is just the mechanics we use to restore tasks registers :)
>
> -- Pavel
>
>> Thank you in advance for your insights.
>>
>> Best regards,
>>   Felix
>>
>>
>> _______________________________________________
>> CRIU mailing list
>> CRIU at openvz.org
>> https://lists.openvz.org/mailman/listinfo/criu
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20200618/1edc49eb/attachment-0001.html>