[CRIU] Maintain long running HPC jobs across sysupgrades using CRIU?

Pavel Emelyanov xemul at virtuozzo.com
Mon Apr 10 06:18:23 PDT 2017


On 04/10/2017 11:50 AM, Stefan Kombrink wrote:
> Hi Pavel,
> 
>  thanks for replying :)
> 
> This is the log tail:
> 
> tail ./restore-2017-04-04T13:22:43+02:00/restore.log
> (00.019892) cg: rewriting
> docker/3cd49e65ade7f1221b63e078c104700fd36a165662363c54f23e8b17f0d7fc36
> to /docker/3cdc03490529b739a14096d3c012f325411ef3149f9a72662920fd99ef152305
> (00.019907) cg: rewriting
> docker/3cd49e65ade7f1221b63e078c104700fd36a165662363c54f23e8b17f0d7fc36
> to /docker/3cdc03490529b739a14096d3c012f325411ef3149f9a72662920fd99ef152305
> (00.019913) cg: rewriting
> docker/3cd49e65ade7f1221b63e078c104700fd36a165662363c54f23e8b17f0d7fc36
> to /docker/3cdc03490529b739a14096d3c012f325411ef3149f9a72662920fd99ef152305
> (00.019919) cg: rewriting
> docker/3cd49e65ade7f1221b63e078c104700fd36a165662363c54f23e8b17f0d7fc36
> to /docker/3cdc03490529b739a14096d3c012f325411ef3149f9a72662920fd99ef152305
> (00.019925) cg: rewriting
> docker/3cd49e65ade7f1221b63e078c104700fd36a165662363c54f23e8b17f0d7fc36
> to /docker/3cdc03490529b739a14096d3c012f325411ef3149f9a72662920fd99ef152305
> (00.019931) cg: rewriting
> docker/3cd49e65ade7f1221b63e078c104700fd36a165662363c54f23e8b17f0d7fc36
> to /docker/3cdc03490529b739a14096d3c012f325411ef3149f9a72662920fd99ef152305
> (00.019940) cg: Preparing cgroups yard (cgroups restore mode 0x4)
> (00.020818) cg: Opening .criu.cgyard.Q1QaGh as cg yard
> (00.020947) cg: 	Making controller dir .criu.cgyard.Q1QaGh/net_cls (net_cls)
> (00.021063) Error (cgroup.c:1562): cg: 	Can't mount controller dir
> .criu.cgyard.Q1QaGh/net_cls: Device or resource busy

Ah, this can be due to different hosts have different cgroups merging
policy. May I look at /proc/cgruoups on both?

We have different cgroups managing policies, you may try using the
--manage-cgroups option when using CRIU w/o Docker:
https://criu.org/CGroups#CGroups_restoring_strategy

> I doubt it is docker related in this case as a colleague of mine
> experienced similar troubles using Criu without docker (sorry, no logs).
> 
> The kernel versions are:
> 
> CentOS7.3
> Name        : kernel
> Arch        : x86_64
> Version     : 3.10.0
> Release     : 514.10.2.el7
> 
> CentOS7.2
> Name        : kernel
> Arch        : x86_64
> Version     : 3.10.0
> Release     : 327.el7

Just a note -- 3.10 might be old. I should have run criu check to make sure
the kernel has all the needed stuff back-ported.

>>> Is updating criu save?
>>
>> Upgrading CRIU is safe in terms of -- newer criu mush read older images
>> and understand them. But as I said, restoring a container is much more
>> than just using criu. Docker may change. Kernel can change in incompatible
>> manner too, but that's rare and should be explicitly turned on.
>>
>>> Is maintaining downwards compatibility for checkpoint restores a goal/on
>>> the roadmap for criu and/or criu integration into docker?
>>
>> Downgrading any component is not guaranteed to work in 100% cases :) We
>> sometimes change criu so that older versions stop understanding newer
>> images. But not the vice-versa.
> 
> Then our key question is:
> Does the CRIO project attempt to make C/R work across system/kernel/criu
> updates (i.e. if the updated kernel version changes the way cgroups behave)?

It does.

> I guess that might be hard to accomplish...

And yes :) sometimes it's not trivial to make things "just work".
In particular migration between different cgroups merge policies (if my guess
above is correct) is of course smth to fix. 

I've created an issue on github so you could follow it. Or contribute the fix
if you get to this before we do ;) https://github.com/xemul/criu/issues/307

>>> Is there, after all, a lesser chance of breakage using docker instead of
>>> non-containerized criu apps?
>>
>> Having Docker checkpoint/restore broken is more likely, than having pure
>> criu broken :) For us C/R is the main feature, for Docker C/R is experimental,
>> so they don't monitor it well enough (yet).
> 
> thanks & greetings
> 
> Stefan
> 
> 



More information about the CRIU mailing list