[CRIU] Checkpointing in parallel?

Pavel Emelyanov xemul at virtuozzo.com
Fri Jun 16 12:49:21 MSK 2017


On 06/16/2017 09:06 AM, Hellmanns, David Immanuel Maria wrote:
>>> On 06/14/2017 10:55 AM, Hellmanns, David Immanuel Maria wrote:
>>> Hi Guys,
>>>
>>>
>>> I am writing my master thesis about reducing downtime during live migration. Therefore I also have a look at CRIU and container migration.
>>>
>>>
>>> I am using Ubuntu 16.04 as host system and run a Ubuntu 16.04 container on it. In my setup the checkpoint procedure takes ~2sec. During
>>> this time all processes are frozen and therefore the downtime accounts to ~2sec, too.
> 
>> Most of the time must be in writing the memory contents. You can check the images/stats-dump
>> file for timings.
> 
> I checked the stats-dump file but the amount of time spent on memwrite is very low compared to frozen time.

Frozen time includes memdump_time, that's why ;)

> "freezing_time": 101432, 
> "frozen_time":  2781552, 
> "memdump_time":   15627, 
> "memwrite_time":   6652, 

Indeed, memdump (and write) times are quite small, ~1% of total frozen time. Freezing time
is noticeable, ~5%. The rest is ... hm ... try disabling logging with -v0, it always helps.
And after than -- show the dump.log (with -v4 again, of course).

> "pages_scanned": 292148, 
> "pages_skipped_parent": 0, 
> "pages_written": 3219, 
> "irmap_resolve": 0
> 
> The container is idling, so the memory consumption (state size in general) is very low. Therefore, pre-dumps 
> cannot reduce the down time. For my thesis it is interesting how I can reach the lowest possible frozen time 
> without modifications of CRIU. 

You can do several things that will help -- turn off logging, use freeze cgroup
(https://criu.org/CLI/opt/--freeze-cgroup), put image directory on tmpfs. BTW, which
version of criu do you use?

>>> I evaluated the code and log files and I noted that a container is treated as an usual process tree. All processes of this process tree
>>> are checkpointed sequentially. On average the checkpoint time of one process was ~170ms and I have 14 processes in total. The question now
>>> arises, is it possible to checkpoint independent processes in parallel? Further, if it is not possible, which dependencies are the reason?
> 
>> It's possible, but for that one would need to add synchronization to the code that dumps shared
>> resources. E.g. open files, shared memory areas, etc. Also, this parallelism is only possible for
>> dump part, while in CRIU there's also freezing and collecting ones (though, they are very short).
> 
> Ok, I assumed that synchronization is necessary but I could not estimate up to what extent.
> Do you think that parallelism can bring a reasonable speed up or is the synchronization overhead too big? 

It will help, yes, locking won't take much time. The problem is that doing parallel checkpoint is
not always acceptable, e.g. on highly loaded systems. But if it is, then parallel dump is the way to
goo too.

> Maybe I will take a look at it in my spare time.

Cool :)

>>> Due to the fact that the main focus of my master thesis is on a more general approach, I do not have the time to familiarize myself with
>>> the checkpointing code in detail.
> 
>> Then you're obliged to show us your thesis (and any other related publication) once it appears :P
> 
> Sure, I will let you know when I publish my results. However, the topic I am working on is not directly
> related to CRIU or containers. Currently I am looking at optimization on the network layer / infrastructure layer.

-- Pavel



More information about the CRIU mailing list