[CRIU] Question: Tightly coupled applications

Sun Jun 24 19:26:21 MSK 2018

On Sat, Jun 23, 2018 at 07:49:30PM +0100, Thouraya TH wrote:
> 2018-06-23 18:44 GMT+01:00 Adrian Reber <adrian at lisas.de>:
> 
> > On Sat, Jun 23, 2018 at 02:11:10PM +0100, Thouraya TH wrote:
> > > *MPI applications would be to be aware of the communication that is going
> > > on and try to restore that communication state after the process
> > restore. *
> > >
> > > This is about MPI library https://www.open-mpi.org/
> > > 1) Running HPC applications, in containers, is gaining significant
> > interest
> > > due to lighweight virtualisation of containers versus VMs (as i know).
> >
> > Not sure if you are actually asking something here or not. The thing you
> > need to be concerned about are communication messages which are on the
> > fly. Especially if the underlying technology is reliable and does not
> > lose any messages. This can lead to a situation like this
> > (theoretically):
> >
> >  * Side A sends message M1 to side B
> >  * Side B is checkpointed and stopped before receiving message M1
> >  * Side A waits for an answer from side B (message M2)
> >  * Network discards the message M1 as receiver on side B is gone
> >  * Side B is restored and waits for ever for message M1
> >
> >  * Side B keeps now waiting for message M1
> >  * Side A keeps waiting for message M2
> >
> > And now both sides are waiting for ever and you would need to replay
> > message M1.
> >
> > So coordinated checkpointing, where you can make sure that no messages
> > are currently on the fly, would make it easier.
> >
> 
> Ok, we can use coordinated or uncoordinated checkpointing using CRIU ?

CRIU does not care. CRIU does not know. In this case it is up to you to
provide whatever you need. If you want coordinated checkpointing you
have to coordinate it.

> > > And, what about web applications (web client - Mysql server application
> > in
> > > a container lxc- Tomcat web server in a container )? There is a
> > > communication also.
> > > 2) If i would like to save snapshots using criu of this application,
> > > therefore i have to restore that communication state after the process
> > > restore ?
> >
> > The difference here is that CRIU has support for TCP/UDP communication
> > and can restore the connection state of the sockets. For TCP the problem
> > described above does not really exist as the protocol has mechanisms to
> > re-transmit messages (if the downtime is short and no timeouts have been
> > triggered). With UDP you probably could end up in a situation like
> > described above, but the application writer should expect that UDP
> > messages can get lost and handle it appropriately.
> >
> 
> OK.
> 
> >
> > > 3) I ask also if checkpoint/restore is useful for this kind of
> > application ?
> >
> > See above. Yes.
> >
> > Especially in HPC you need to know what you want. Do you want fault
> > tolerance and restore an application from a periodically taken
> > checkpoint. Or do you want to be able to migrate applications/containers
> > for other reasons like load balancing. Depending on the use case you
> > need different strategies for the communication on the wire and it
> > also depends on the used protocols.
> >
> 
>  Yes, i want fault tolerance and restore an application from a periodically
> taken
>   checkpoint (application which communicate such mpi).

Your first step is to add CRIU support to whichever MPI variant you are
working with. As far as I know no MPI implementation has complete
support for checkpointing via CRIU.

Once that is done you can look how those MPI variants handle coordinated
checkpointing.

		Adrian