[CRIU] cgroups restore issue at AWS

Fri Aug 21 15:39:46 PDT 2015

Hi,

We're seeing some issues doing docker-based restores on AWS machines.  On the first try the restore.log shows the following output:
  (00.000551) cg: Preparing cgroups yard (cgroups restore mode 0x4)
  (00.000607) cg: Opening .criu.cgyard.aRmYI0 as cg yard
  (00.000617) cg:         Making controller dir .criu.cgyard.aRmYI0/cpuset (cpuset)
  (00.000691) cg: Created cgroup dir cpuset/system.slice/docker-f00d0fe34bcc352377f4750f99fc4a649bd14db65fc15639df35043c62f7733a.scope
  (00.000733) Error (cgroup.c:978): cg: Failed closing cpuset/system.slice/docker-f  00d0fe34bcc352377f4750f99fc4a649bd14db65fc15639df35043c62f7733a.scope/cpuset.cpus: Invalid argument
  (00.000737) Error (cgroup.c:1083): cg: Restoring special cpuset props failed!

On the 2nd try the restore works because it skips the attempt:

  (00.000785) cg: Preparing cgroups yard (cgroups restore mode 0x4)
  (00.000840) cg: Opening .criu.cgyard.BQ2bKQ as cg yard
  (00.000850) cg: 	Making controller dir .criu.cgyard.BQ2bKQ/cpuset (cpuset)
  (00.000877) cg: Determined cgroup dir cpuset/system.slice/docker-404a13eab68e35753ee2c66f636aa727aa2c9a7723671d25cc9ffb0ede574178.scope already exist
  (00.000880) cg: Skip restoring properties on cgroup dir cpuset/system.slice/docker-404a13eab68e35753ee2c66f636aa727aa2c9a7723671d25cc9ffb0ede574178.scope

It appears to be a timing issue on the fclose(f) call in cgroups.c.  I've tried using CG_MODE_SOFT and CG_MODE_FULL and neither have an affect, the 1st attempt fails and the 2nd succeeds.

To workaround the issue, we've created a fork with these changes and the issue hasn't recurred.  In fact there hasn't even been a single "Failed to flush..." message printed in the logs, so it seems to be a matter of split second timing that the for loop allows enough time for the handle to flush.

diff --git a/cgroup.c b/cgroup.c
index a4e0146..9495206 100644
--- a/cgroup.c
+++ b/cgroup.c
@@ -950,6 +950,8 @@ static int restore_cgroup_prop(const CgroupPropEntry * cg_prop_entry_p,
 {
        FILE *f;
        int cg;
+       int flushcounter=0;
+       int maxtries=500;
 
        if (!cg_prop_entry_p->value) {
                pr_err("cg_prop_entry->value was empty when should have had a value");
@@ -974,9 +976,26 @@ static int restore_cgroup_prop(const CgroupPropEntry * cg_prop_entry_p,
                return -1;
        }
 
+       /* The fclose() below was failing intermittently with EINVAL at AWS*/
+       /* So we try fflush() in a loop until it succeeds or we've */
+       /* tried it a bunch. */
+       for (;;) {
+               flushcounter++;
+               if (fflush(f) == 0) {
+                       break;
+               }
+               if (flushcounter > maxtries) {
+                       pr_perror("Max fflush() tries %d exceeded.  Moving along anyway.\n",maxtries);
+                       break;
+               }
+               if (fflush(f) != 0) {
+                 pr_perror("Failed to flush %s [%d/%d]\n", path, flushcounter,maxtries);
+               }
+       }
+
        if (fclose(f) != 0) {
-               pr_perror("Failed closing %s", path);
-               return -1;
+         pr_perror("Failed closing %s\n",path);
+         return -1; 
        }

Can anyone reproduce the issue of offer a suggestion on how we should proceed?

Mark