[CRIU] Maintain long running HPC jobs across sysupgrades using CRIU?
Pavel Emelyanov
xemul at virtuozzo.com
Fri Apr 7 06:19:39 PDT 2017
On 04/05/2017 01:24 PM, Stefan Kombrink wrote:
> Hi folks,
>
> I work in a HPC environment where we want to checkpoint long running
> processes every couple of days.
> So far checkpointing and restore seems to work fine with latest
> docker-ce and criu 2.3.
> But we also occasionally do system updates and while restore was okay
> after a kernel update I wasn't able to restore after upgrading from
> CentOS7.2 to CentOS7.3 (error about cgroup mount failed)
Hm... May I see the logs, please?
> So the question is?
> Which upgrades/updates might break restore functionality?
Ideally no upgrades should, but sometimes Docker guys change the way
they configure environment for containers (e.g. cgroups) and older
images stop to work :(
> Is updating criu save?
Upgrading CRIU is safe in terms of -- newer criu mush read older images
and understand them. But as I said, restoring a container is much more
than just using criu. Docker may change. Kernel can change in incompatible
manner too, but that's rare and should be explicitly turned on.
> Is maintaining downwards compatibility for checkpoint restores a goal/on
> the roadmap for criu and/or criu integration into docker?
Downgrading any component is not guaranteed to work in 100% cases :) We
sometimes change criu so that older versions stop understanding newer
images. But not the vice-versa.
> Is there, after all, a lesser chance of breakage using docker instead of
> non-containerized criu apps?
Having Docker checkpoint/restore broken is more likely, than having pure
criu broken :) For us C/R is the main feature, for Docker C/R is experimental,
so they don't monitor it well enough (yet).
-- Pavel
More information about the CRIU
mailing list