[CRIU] Maintain long running HPC jobs across sysupgrades using CRIU?

Stefan Kombrink stefan.kombrink at uni-ulm.de
Mon Apr 10 01:50:21 PDT 2017


Hi Pavel,

 thanks for replying :)

This is the log tail:

tail ./restore-2017-04-04T13:22:43+02:00/restore.log
(00.019892) cg: rewriting
docker/3cd49e65ade7f1221b63e078c104700fd36a165662363c54f23e8b17f0d7fc36
to /docker/3cdc03490529b739a14096d3c012f325411ef3149f9a72662920fd99ef152305
(00.019907) cg: rewriting
docker/3cd49e65ade7f1221b63e078c104700fd36a165662363c54f23e8b17f0d7fc36
to /docker/3cdc03490529b739a14096d3c012f325411ef3149f9a72662920fd99ef152305
(00.019913) cg: rewriting
docker/3cd49e65ade7f1221b63e078c104700fd36a165662363c54f23e8b17f0d7fc36
to /docker/3cdc03490529b739a14096d3c012f325411ef3149f9a72662920fd99ef152305
(00.019919) cg: rewriting
docker/3cd49e65ade7f1221b63e078c104700fd36a165662363c54f23e8b17f0d7fc36
to /docker/3cdc03490529b739a14096d3c012f325411ef3149f9a72662920fd99ef152305
(00.019925) cg: rewriting
docker/3cd49e65ade7f1221b63e078c104700fd36a165662363c54f23e8b17f0d7fc36
to /docker/3cdc03490529b739a14096d3c012f325411ef3149f9a72662920fd99ef152305
(00.019931) cg: rewriting
docker/3cd49e65ade7f1221b63e078c104700fd36a165662363c54f23e8b17f0d7fc36
to /docker/3cdc03490529b739a14096d3c012f325411ef3149f9a72662920fd99ef152305
(00.019940) cg: Preparing cgroups yard (cgroups restore mode 0x4)
(00.020818) cg: Opening .criu.cgyard.Q1QaGh as cg yard
(00.020947) cg: 	Making controller dir .criu.cgyard.Q1QaGh/net_cls (net_cls)
(00.021063) Error (cgroup.c:1562): cg: 	Can't mount controller dir
.criu.cgyard.Q1QaGh/net_cls: Device or resource busy

I doubt it is docker related in this case as a colleague of mine
experienced similar troubles using Criu without docker (sorry, no logs).

The kernel versions are:

CentOS7.3
Name        : kernel
Arch        : x86_64
Version     : 3.10.0
Release     : 514.10.2.el7

CentOS7.2
Name        : kernel
Arch        : x86_64
Version     : 3.10.0
Release     : 327.el7

>> Is updating criu save?
> 
> Upgrading CRIU is safe in terms of -- newer criu mush read older images
> and understand them. But as I said, restoring a container is much more
> than just using criu. Docker may change. Kernel can change in incompatible
> manner too, but that's rare and should be explicitly turned on.
> 
>> Is maintaining downwards compatibility for checkpoint restores a goal/on
>> the roadmap for criu and/or criu integration into docker?
> 
> Downgrading any component is not guaranteed to work in 100% cases :) We
> sometimes change criu so that older versions stop understanding newer
> images. But not the vice-versa.

Then our key question is:
Does the CRIO project attempt to make C/R work across system/kernel/criu
updates (i.e. if the updated kernel version changes the way cgroups behave)?
I guess that might be hard to accomplish...

>> Is there, after all, a lesser chance of breakage using docker instead of
>> non-containerized criu apps?
> 
> Having Docker checkpoint/restore broken is more likely, than having pure
> criu broken :) For us C/R is the main feature, for Docker C/R is experimental,
> so they don't monitor it well enough (yet).

thanks & greetings

Stefan


-- 
Stefan Kombrink
Universität Ulm
kiz / Abteilung Infrastruktur
+49-731-50-22439


More information about the CRIU mailing list