<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:#0563C1;
        text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin-top:0in;
        margin-right:0in;
        margin-bottom:0in;
        margin-left:.5in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri",sans-serif;
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri",sans-serif;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
/* List Definitions */
@list l0
        {mso-list-id:200018075;
        mso-list-type:hybrid;
        mso-list-template-ids:968887450 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level2
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level5
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level8
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l1
        {mso-list-id:1789737930;
        mso-list-type:hybrid;
        mso-list-template-ids:-741556290 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l1:level1
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l1:level2
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l1:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l1:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l1:level5
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l1:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l1:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l1:level8
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l1:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
ol
        {margin-bottom:0in;}
ul
        {margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72">
<p class="msipheader251902e5" align="Left" style="margin:0"><span style="font-size:10.0pt;font-family:Arial;color:#317100">[AMD Public Use]</span></p>
<br>
<div class="WordSection1">
<p class="MsoNormal">Hi CRIU team,<o:p></o:p></p>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal">Further to initial discussion that happened on this (<a href="https://lists.openvz.org/pipermail/criu/2020-June/045030.html">https://lists.openvz.org/pipermail/criu/2020-June/045030.html</a>) thread I would like to hear some advice from
 the community on some issues I am facing. <o:p></o:p></p>
<p class="MsoNormal">Here is some description of my scenario:<o:p></o:p></p>
<p class="MsoNormal">For my current simplified &nbsp;use case there is a simple test application that opens up amd
<a href="KFD">KFD</a> driver file descriptors using ROCT library via <a href="https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/openclose.c#L167">
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/openclose.c#L167</a>. For now I am not creating any user mode queue or allocating any memory on the gpu. Just opening up a device handle and closing
 it after some time. While my test app is running, I try to dump it using criu. I have also implemented some skeleton code for a corresponding device file plugin in /criu/test/others/ext-kfd/Kfd_plugin.c (with .so copied to /var/lib/criu) which for now implements
 few (dummy for now) callbacks such as cr_plugin_init, cr_plugin_fini, cr_plugin_dump_file, cr_plugin_restore_file. &nbsp;From cr_plugin_dump_file callback, I call into KFD driver using the ptrace attached file descriptor that we obtained in cr_plugin_dump_file
 via a newly implemented ioctl which we intend to use for dumping internal gpu device state/mappings/memory etc and pass it on back to the plugin which is then supposed to save/serialize that data in img files.
<o:p></o:p></p>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal">Issues:<o:p></o:p></p>
<p class="MsoNormal">When I try to dump my test app I am running into two issues:<o:p></o:p></p>
<ol style="margin-top:0in" start="1" type="1">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo2">During task dumping, criu fails with following errors even before calling into the plugin.<o:p></o:p></li><ol style="margin-top:0in" start="1" type="a">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level2 lfo2">(00.035302) Dumping path for -3 fd via self 12 [/dev/kfd]<o:p></o:p></li></ol>
</ol>
<p class="MsoNormal" style="margin-left:.5in;text-indent:.5in">(00.035188) Error (criu/proc_parse.c:603): Can't handle non-regular mapping on 41272's map 7f3fcd084000<o:p></o:p></p>
<p class="MsoListParagraph" style="margin-left:1.0in">(00.035325) Error (criu/proc_parse.c:680): Unsupported mapping found 00007f3fcd084000-00007f3fcd085000<o:p></o:p></p>
<p class="MsoListParagraph" style="margin-left:1.0in">(00.035346) Error (criu/cr-dump.c:1248): Collect mappings (pid: 41272) failed with -1<o:p></o:p></p>
<ol style="margin-top:0in" start="1" type="1">
<ol style="margin-top:0in" start="2" type="a">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level2 lfo2">My understanding was that for such device file mappings, we need plugin to handle but even before the plugin is called, we see this fatal error. I tried to skip it but then ran into
 some other issues elsewhere. Can you please advise how to handle this case since
<a href="https://github.com/checkpoint-restore/criu/blob/1acfb4c609a70cf2cc4d47c70b47cbe99151ebcd/criu/proc_parse.c#L603">
https://github.com/checkpoint-restore/criu/blob/1acfb4c609a70cf2cc4d47c70b47cbe99151ebcd/criu/proc_parse.c#L603</a> doesn&#8217;t seem to handle this case well?<o:p></o:p></li></ol>
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo2">There is one shared mem object and we see failure for it too.<o:p></o:p></li><ol style="margin-top:0in" start="1" type="a">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level2 lfo2">(00.035666) Dumping path for -3 fd via self 12 [/dev/shm/hsakmt_shared_mem]<o:p></o:p></li></ol>
</ol>
<p class="MsoListParagraph" style="margin-left:1.0in">(00.035706) Only file size could be stored for validation for file /dev/shm/hsakmt_shared_mem<o:p></o:p></p>
<p class="MsoListParagraph" style="margin-left:1.0in">(00.035838) Dumping path for -3 fd via self 12 [/dev/shm/0zONRb (deleted)]<o:p></o:p></p>
<p class="MsoListParagraph" style="margin-left:1.0in">(00.035860) Error (criu/files-reg.c:978): Can't create link remap for /dev/shm/0zONRb (deleted). Use link-remap option.<o:p></o:p></p>
<p class="MsoListParagraph" style="margin-left:1.0in">(00.035878) Error (criu/cr-dump.c:1248): Collect mappings (pid: 35917) failed with -1<o:p></o:p></p>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal">When I comment out <a href="https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/openclose.c#L217">
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/openclose.c#L217</a> which creates the shared memory mapping and when I return NULL from
<a href="https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/fmm.c#L2050">
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/d9efc3d02d0daf101ccf67f78599cd13118001bf/src/fmm.c#L2050</a> before actual mmap call, I see cr_plugin_dump_file gets called which also calls into kfd driver via new ioctl and entire dumping process
 is finished successfully.&nbsp; &nbsp;<o:p></o:p></p>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal">I am not sure how to deal with vmas associated with device files or shared memory mappings so looking forward to your advice and further suggestions to implement something that&#8217;s either missing in criu or something else based on my revised
 understanding of how plugin are supposed to work for device files.<o:p></o:p></p>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal">Thanks in advance,<o:p></o:p></p>
<p class="MsoNormal">Rajneesh<o:p></o:p></p>
</div>
</body>
</html>