[CRIU] Question: Tightly coupled applications

Sat Jun 23 20:44:18 MSK 2018

On Sat, Jun 23, 2018 at 02:11:10PM +0100, Thouraya TH wrote:
> *MPI applications would be to be aware of the communication that is going
> on and try to restore that communication state after the process restore. *
> 
> This is about MPI library https://www.open-mpi.org/
> 1) Running HPC applications, in containers, is gaining significant interest
> due to lighweight virtualisation of containers versus VMs (as i know).

Not sure if you are actually asking something here or not. The thing you
need to be concerned about are communication messages which are on the
fly. Especially if the underlying technology is reliable and does not
lose any messages. This can lead to a situation like this
(theoretically):

 * Side A sends message M1 to side B
 * Side B is checkpointed and stopped before receiving message M1
 * Side A waits for an answer from side B (message M2)
 * Network discards the message M1 as receiver on side B is gone
 * Side B is restored and waits for ever for message M1 

 * Side B keeps now waiting for message M1
 * Side A keeps waiting for message M2

And now both sides are waiting for ever and you would need to replay
message M1.

So coordinated checkpointing, where you can make sure that no messages
are currently on the fly, would make it easier.

> And, what about web applications (web client - Mysql server application in
> a container lxc- Tomcat web server in a container )? There is a
> communication also.
> 2) If i would like to save snapshots using criu of this application,
> therefore i have to restore that communication state after the process
> restore ?

The difference here is that CRIU has support for TCP/UDP communication
and can restore the connection state of the sockets. For TCP the problem
described above does not really exist as the protocol has mechanisms to
re-transmit messages (if the downtime is short and no timeouts have been
triggered). With UDP you probably could end up in a situation like
described above, but the application writer should expect that UDP
messages can get lost and handle it appropriately.

> 3) I ask also if checkpoint/restore is useful for this kind of application ?

See above. Yes.

Especially in HPC you need to know what you want. Do you want fault
tolerance and restore an application from a periodically taken
checkpoint. Or do you want to be able to migrate applications/containers
for other reasons like load balancing. Depending on the use case you
need different strategies for the communication on the wire and it
also depends on the used protocols.

		Adrian