<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:#0563C1;
        text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin-top:0in;
        margin-right:0in;
        margin-bottom:0in;
        margin-left:.5in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri",sans-serif;
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri",sans-serif;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
/* List Definitions */
@list l0
        {mso-list-id:200018075;
        mso-list-type:hybrid;
        mso-list-template-ids:968887450 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level2
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level5
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level8
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l1
        {mso-list-id:1789737930;
        mso-list-type:hybrid;
        mso-list-template-ids:-741556290 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l1:level1
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l1:level2
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l1:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l1:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l1:level5
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l1:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l1:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l1:level8
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l1:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
ol
        {margin-bottom:0in;}
ul
        {margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72">
<p class="msipheader251902e5" align="Left" style="margin:0"><span style="font-size:10.0pt;font-family:Arial;color:#317100">[AMD Public Use]</span></p>
<br>
<div class="WordSection1">
<p class="MsoNormal">Hi CRIU team,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Further to initial discussion that happened on this (<a href="https://lists.openvz.org/pipermail/criu/2020-June/045030.html">https://lists.openvz.org/pipermail/criu/2020-June/045030.html</a>) thread I would like to hear some advice from
the community on some issues I am facing. <o:p></o:p></p>
<p class="MsoNormal">Here is some description of my scenario:<o:p></o:p></p>
<p class="MsoNormal">For my current simplified use case there is a simple test application that opens up amd
<a href="KFD">KFD</a> driver file descriptors using ROCT library via <a href="https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/openclose.c#L167">
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/openclose.c#L167</a>. For now I am not creating any user mode queue or allocating any memory on the gpu. Just opening up a device handle and closing
it after some time. While my test app is running, I try to dump it using criu. I have also implemented some skeleton code for a corresponding device file plugin in /criu/test/others/ext-kfd/Kfd_plugin.c (with .so copied to /var/lib/criu) which for now implements
few (dummy for now) callbacks such as cr_plugin_init, cr_plugin_fini, cr_plugin_dump_file, cr_plugin_restore_file. From cr_plugin_dump_file callback, I call into KFD driver using the ptrace attached file descriptor that we obtained in cr_plugin_dump_file
via a newly implemented ioctl which we intend to use for dumping internal gpu device state/mappings/memory etc and pass it on back to the plugin which is then supposed to save/serialize that data in img files.
<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Issues:<o:p></o:p></p>
<p class="MsoNormal">When I try to dump my test app I am running into two issues:<o:p></o:p></p>
<ol style="margin-top:0in" start="1" type="1">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo2">During task dumping, criu fails with following errors even before calling into the plugin.<o:p></o:p></li><ol style="margin-top:0in" start="1" type="a">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level2 lfo2">(00.035302) Dumping path for -3 fd via self 12 [/dev/kfd]<o:p></o:p></li></ol>
</ol>
<p class="MsoNormal" style="margin-left:.5in;text-indent:.5in">(00.035188) Error (criu/proc_parse.c:603): Can't handle non-regular mapping on 41272's map 7f3fcd084000<o:p></o:p></p>
<p class="MsoListParagraph" style="margin-left:1.0in">(00.035325) Error (criu/proc_parse.c:680): Unsupported mapping found 00007f3fcd084000-00007f3fcd085000<o:p></o:p></p>
<p class="MsoListParagraph" style="margin-left:1.0in">(00.035346) Error (criu/cr-dump.c:1248): Collect mappings (pid: 41272) failed with -1<o:p></o:p></p>
<ol style="margin-top:0in" start="1" type="1">
<ol style="margin-top:0in" start="2" type="a">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level2 lfo2">My understanding was that for such device file mappings, we need plugin to handle but even before the plugin is called, we see this fatal error. I tried to skip it but then ran into
some other issues elsewhere. Can you please advise how to handle this case since
<a href="https://github.com/checkpoint-restore/criu/blob/1acfb4c609a70cf2cc4d47c70b47cbe99151ebcd/criu/proc_parse.c#L603">
https://github.com/checkpoint-restore/criu/blob/1acfb4c609a70cf2cc4d47c70b47cbe99151ebcd/criu/proc_parse.c#L603</a> doesn’t seem to handle this case well?<o:p></o:p></li></ol>
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo2">There is one shared mem object and we see failure for it too.<o:p></o:p></li><ol style="margin-top:0in" start="1" type="a">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level2 lfo2">(00.035666) Dumping path for -3 fd via self 12 [/dev/shm/hsakmt_shared_mem]<o:p></o:p></li></ol>
</ol>
<p class="MsoListParagraph" style="margin-left:1.0in">(00.035706) Only file size could be stored for validation for file /dev/shm/hsakmt_shared_mem<o:p></o:p></p>
<p class="MsoListParagraph" style="margin-left:1.0in">(00.035838) Dumping path for -3 fd via self 12 [/dev/shm/0zONRb (deleted)]<o:p></o:p></p>
<p class="MsoListParagraph" style="margin-left:1.0in">(00.035860) Error (criu/files-reg.c:978): Can't create link remap for /dev/shm/0zONRb (deleted). Use link-remap option.<o:p></o:p></p>
<p class="MsoListParagraph" style="margin-left:1.0in">(00.035878) Error (criu/cr-dump.c:1248): Collect mappings (pid: 35917) failed with -1<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">When I comment out <a href="https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/openclose.c#L217">
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/openclose.c#L217</a> which creates the shared memory mapping and when I return NULL from
<a href="https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/fmm.c#L2050">
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/fmm.c#L2050</a> before actual mmap call, I see cr_plugin_dump_file gets called which also calls into kfd driver via new ioctl and entire dumping process
is finished successfully. <o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I am not sure how to deal with vmas associated with device files or shared memory mappings so looking forward to your advice and further suggestions to implement something that’s either missing in criu or something else based on my revised
understanding of how plugin are supposed to work for device files.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Thanks in advance,<o:p></o:p></p>
<p class="MsoNormal">Rajneesh<o:p></o:p></p>
</div>
</body>
</html>