[CRIU] Questions: CRIU applicability for Fault Tolerance, Simplest setup to demonstrate Container C+R on same HW server

Huve, Frederic frederic.huve at hpe.com
Wed Jan 24 21:34:08 MSK 2018


Pavel,

Really appreciate your response, I'm wondering whether there's a specific forum that can be used by the CRIU users to raise such questions that is not directly related to the CRIU project evolution ?
The next steps are now
- To execute our test with TCP connections (need to investigate further whether the advantages to use the --tcp-established option) , and I'm wondering if you have heard about tests with SCTP transport ?
- To use incremental dump - but as we are using Docker I'm not sure not to proceed as it looks not available from Docker C/R interface ?

Last but not least, on my product side, to move from a prototyping exercise to a production environment, we'll need RHEL & potentially Docker to integrate CRIU as a standard extension. Do you if it is planned ?
The alternative would be to sub-contract to companies that can support CRIU on behalf of HPE, do you know if there's such companies ?

Thanks
Fred

-----Original Message-----
From: Pavel Emelyanov [mailto:xemul at virtuozzo.com]
Sent: Thursday, January 18, 2018 5:41 PM
To: Huve, Frederic <frederic.huve at hpe.com>; criu at openvz.org
Cc: Prost, Nicolas <nicolas.prost at hpe.com>
Subject: Re: [CRIU] Questions: CRIU applicability for Fault Tolerance, Simplest setup to demonstrate Container C+R on same HW server

On 01/11/2018 12:57 PM, Huve, Frederic wrote:
> Hello CRIU team,
>
> My name is Frederic Huve working as a solution architect in the HPE Telco vertical. I'm following CRIU project since end of 2015 when the /"Leveraging Linux Containers to Achieve High Availability for Cloud Services, Li et al., 2015/" white paper was published.
>
> We've started since 6 months an internal technology project aimed to figure out if this technology can be used for one of our HPE Telco product. First results are very positive with Docker 17.03.0-ce & CRIUv2.3, and we'd like to move forward in the investigation, in order to do so we have some key questions we'd like to ask.
>
> Background
>
> -          Basically this HPE product is a Telecom Application server running in container like Docker based on Wildfly, it is executing applications that are on one side connected to the Telco networks via IP protocols using UDP, TCP (SCTP) transport and on the other side connected to databases, or enterprises back-ends via the same transport protocols.
>
> -          We'd like to use Docker C/R feature based on CRIU (alternative is to use directly CRIU with the Java Machine Process) to achieve Fault-Tolerance (FT), so a bit different from classical Live Migration use-case.
>
> -          The objective is to periodically checkpoint the container, and if an error/failure occurs, restore the container on the same or another similar node based on the latest complete checkpoint file.

Wow :)

> Questions
>
> -          Do you believe that CRIU software is appropriate to achieve such FT feature?

Sure we do! Tough many tricky issues appear. E.g. -- what to do with TCP connections (most likely drop all). Or -- whether or not to use incremental checkpoints (likely yes).

> -          We observed that during the checkpoint operation, the
> container is "completely frozen" for the overall operation duration. Unfortunately the telecom protocols are quite timing sensitive (e.g. protocol stack timer). Therefore it is critical to slice the checkpoint operation execution (that will take longer overall, but give the possibility to container to handle timers or network Events in between the checkpoint slice) like the incremental dump operation, or maybe other optimization that are not yet public?

There's such a thing called "pre-dump". It does checkpoint, but without stopping the processes. Of course, the resulted dump becomes inconsistent, so after pre-dump you have to do regular dump, that would pick up memory changes and dump the rest. The freeze time would be much smaller in this case.

> -          If so, for a 2G RAM Container what would be your bet on the minimal "frozen time slice" duration?

It greatly depends on the storage speed you use, since most of the time criu will write those 2gigs on it :)

> We're currently working on a 2^nd version of the CRIU Proof of concept aimed to checkpoint a container, stop it, and restore it on the same HW server running CentOS 7.4 and latest Docker & CRIU versions (PLs find below Docker & CRIU version info). Looks simple but we're facing issues on the filesystem side (probably because of temporary files created, we're still investigating), what's would be the simplest setup in order to succeed rapidly?

CRIU doesn't work with filesystem contents, yes. It expects the files to be in the exactly same state as they were dump-time. The most common thing to be done in this case is to create FS (or disk) snapshot too, so that restore time you have exactly the same FS as you had dump time. But in FT case this may be rather slow (and you have to use network FS/storage), so other options may apply. An any case -- general rule is to keep files'
structure, size and contents.

> Really appreciated your help to guide us in the right direction!
>
> Thanks,
>
> Fred
>
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------
>
> [root at cgcriu ~]# uname -a
>
> Linux cgcriu 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017
> x86_64 x86_64 x86_64 GNU/Linux
>
>
>
> [root at cgcriu ~]# docker info
>
> Containers: 1
>
>  Running: 1
>
>  Paused: 0
>
>  Stopped: 0
>
> Images: 7
>
> Server Version: 17.09.0-ce
>
> Storage Driver: overlay2
>
>  Backing Filesystem: xfs
>
>  Supports d_type: true
>
>  Native Overlay Diff: true
>
> Logging Driver: json-file
>
> Cgroup Driver: cgroupfs
>
> Plugins:
>
>  Volume: local
>
>  Network: bridge host ipvlan macvlan null overlay
>
>  Log: awslogs fluentd gcplogs gelf journald json-file logentries
> splunk syslog
>
> Swarm: inactive
>
> Runtimes: runc
>
> Default Runtime: runc
>
> Init Binary: docker-init
>
> containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
>
> runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
>
> init version: 949e6fa
>
> Security Options:
>
>  seccomp
>
>   Profile: default
>
> Kernel Version: 3.10.0-693.el7.x86_64
>
> Operating System: CentOS Linux 7 (Core)
>
> OSType: linux
>
> Architecture: x86_64
>
> CPUs: 32
>
> Total Memory: 110GiB
>
> Name: cgcriu
>
> ID: PT2D:X72Q:KFLE:NXTZ:P7BB:FXKV:L4QI:MSA6:KLT5:HRPL:XJ5I:NICA
>
> Docker Root Dir: /var/lib/docker
>
> Debug Mode (client): false
>
> Debug Mode (server): false
>
> Registry: https://index.docker.io/v1/
>
> Experimental: true
>
> Insecure Registries:
>
>  127.0.0.0/8
>
> Live Restore Enabled: false
>
>
>
> [root at cgcriu ~]# criu -V
>
> Version: 3.7
>
> GitID: v3.7
>
>
>
>
>
> _______________________________________________
> CRIU mailing list
> CRIU at openvz.org
> https://lists.openvz.org/mailman/listinfo/criu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20180124/0cc0ba9f/attachment-0001.html>


More information about the CRIU mailing list