[CRIU] Maintain long running HPC jobs across sysupgrades using CRIU?

Stefan Kombrink stefan.kombrink at uni-ulm.de
Wed Apr 5 03:24:41 PDT 2017


Hi folks,

 I work in a HPC environment where we want to checkpoint long running
processes every couple of days.
So far checkpointing and restore seems to work fine with latest
docker-ce and criu 2.3.
But we also occasionally do system updates and while restore was okay
after a kernel update I wasn't able to restore after upgrading from
CentOS7.2 to CentOS7.3 (error about cgroup mount failed)

So the question is?
Which upgrades/updates might break restore functionality?
Is updating criu save?
Is maintaining downwards compatibility for checkpoint restores a goal/on
the roadmap for criu and/or criu integration into docker?

Is there, after all, a lesser chance of breakage using docker instead of
non-containerized criu apps?

thanks & greets
Stefan

-- 
Stefan Kombrink
Universität Ulm
kiz / Abteilung Infrastruktur
+49-731-50-22439


More information about the CRIU mailing list