[CRIU] Questions: CRIU applicability for Fault Tolerance, Simplest setup to demonstrate Container C+R on same HW server

Thu Jan 11 12:57:05 MSK 2018

Hello CRIU team,
My name is Frederic Huve working as a solution architect in the HPE Telco vertical. I'm following CRIU project since end of 2015 when the "Leveraging Linux Containers to Achieve High Availability for Cloud Services, Li et al., 2015" white paper was published.
We've started since 6 months an internal technology project aimed to figure out if this technology can be used for one of our HPE Telco product. First results are very positive with Docker 17.03.0-ce & CRIUv2.3, and we'd like to move forward in the investigation, in order to do so we have some key questions we'd like to ask.
Background

-          Basically this HPE product is a Telecom Application server running in container like Docker based on Wildfly, it is executing applications that are on one side connected to the Telco networks via IP protocols using UDP, TCP (SCTP) transport and on the other side connected to databases, or enterprises back-ends via the same transport protocols.

-          We'd like to use Docker C/R feature based on CRIU (alternative is to use directly CRIU with the Java Machine Process) to achieve Fault-Tolerance (FT), so a bit different from classical Live Migration use-case.

-          The objective is to periodically checkpoint the container, and if an error/failure occurs, restore the container on the same or another similar node based on the latest complete checkpoint file.
Questions

-          Do you believe that CRIU software is appropriate to achieve such FT feature?

-          We observed that during the checkpoint operation, the container is "completely frozen" for the overall operation duration. Unfortunately the telecom protocols are quite timing sensitive (e.g. protocol stack timer). Therefore it is critical to slice the checkpoint operation execution (that will take longer overall, but give the possibility to container to handle timers or network Events in between the checkpoint slice) like the incremental dump operation, or maybe other optimization that are not yet public?

-          If so, for a 2G RAM Container what would be your bet on the minimal "frozen time slice" duration?
We're currently working on a 2nd version of the CRIU Proof of concept aimed to checkpoint a container, stop it, and restore it on the same HW server running CentOS 7.4 and latest Docker & CRIU versions (PLs find below Docker & CRIU version info). Looks simple but we're facing issues on the filesystem side (probably because of temporary files created, we're still investigating), what's would be the simplest setup in order to succeed rapidly?
Really appreciated your help to guide us in the right direction!
Thanks,
Fred
________________________________
[root at cgcriu ~]# uname -a
Linux cgcriu 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

[root at cgcriu ~]# docker info
Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 7
Server Version: 17.09.0-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-693.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 110GiB
Name: cgcriu
ID: PT2D:X72Q:KFLE:NXTZ:P7BB:FXKV:L4QI:MSA6:KLT5:HRPL:XJ5I:NICA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

[root at cgcriu ~]# criu -V
Version: 3.7
GitID: v3.7

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20180111/dbfb9291/attachment-0001.html>