[CRIU] Maintain long running HPC jobs across sysupgrades using CRIU?

Fri Apr 7 06:19:39 PDT 2017

On 04/05/2017 01:24 PM, Stefan Kombrink wrote:
> Hi folks,
> 
>  I work in a HPC environment where we want to checkpoint long running
> processes every couple of days.
> So far checkpointing and restore seems to work fine with latest
> docker-ce and criu 2.3.
> But we also occasionally do system updates and while restore was okay
> after a kernel update I wasn't able to restore after upgrading from
> CentOS7.2 to CentOS7.3 (error about cgroup mount failed)

Hm... May I see the logs, please?

> So the question is?
> Which upgrades/updates might break restore functionality?

Ideally no upgrades should, but sometimes Docker guys change the way
they configure environment for containers (e.g. cgroups) and older
images stop to work :(

> Is updating criu save?

Upgrading CRIU is safe in terms of -- newer criu mush read older images
and understand them. But as I said, restoring a container is much more
than just using criu. Docker may change. Kernel can change in incompatible
manner too, but that's rare and should be explicitly turned on.

> Is maintaining downwards compatibility for checkpoint restores a goal/on
> the roadmap for criu and/or criu integration into docker?

Downgrading any component is not guaranteed to work in 100% cases :) We
sometimes change criu so that older versions stop understanding newer
images. But not the vice-versa.

> Is there, after all, a lesser chance of breakage using docker instead of
> non-containerized criu apps?

Having Docker checkpoint/restore broken is more likely, than having pure
criu broken :) For us C/R is the main feature, for Docker C/R is experimental,
so they don't monitor it well enough (yet).

-- Pavel