[CRIU] [RFC PATCH 20/20] criu/files: RFC Don't cache fd for amdgpu devices

Sat May 1 04:58:45 MSK 2021

From: Rajneesh Bhardwaj <rajneesh.bhardwaj at amd.com>

Restore operation fails when we perform CR operation of multiple
independent kfd proceses because criu caches the ids for the device
files with same mnt_ids, inode pair. This might not be the optimal
solution but a unique id is required for each kfd process for a
successful restore. Decoded fdinfo files also confirm a duplicate id for
amdgpu devices and thus cache lookup during checkpoint prevents the
actual metadata for the device VMA write in the image files so on a
subsequent restore the desired id is not found and hence restore fails
even though the actual image file e.g. renderDXXX.img exits.

This can be tested with any simple kfd application that uses AMD ROCt
library and creates a VRAM buffer object. CR on such individual
processes works fine but when we invoke more than one process from a
shell script and perform CR on this shell script, the restore operation
fails without this change.

Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj at amd.com>
---
 criu/file-ids.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/criu/file-ids.c b/criu/file-ids.c
index 006e47d64..865fd29e6 100644
--- a/criu/file-ids.c
+++ b/criu/file-ids.c
@@ -77,11 +77,18 @@ int fd_id_generate_special(struct fd_parms *p, u32 *id)
 {
 	if (p) {
 		struct fd_id *fi;
+		struct stat st_kfd;
 
 		fi = fd_id_cache_lookup(p);
 		if (fi) {
-			*id = fi->id;
-			return 0;
+			if (stat("/dev/kfd", &st_kfd) == -1) {
+				*id = fi->id;
+				return 0;
+			} else {
+				/* Don't cache the id */
+				*id = fd_tree.subid++;
+				return 1;
+			}
 		}
 	}
 
-- 
2.17.1



[CRIU] [RFC PATCH 20/20] criu/files: *RFC* Don't cache fd for amdgpu devices

[CRIU] [RFC PATCH 20/20] criu/files: RFC Don't cache fd for amdgpu devices