[CRIU] Question: Tightly coupled applications
Thouraya TH
thouraya87 at gmail.com
Sat Jun 23 21:49:30 MSK 2018
2018-06-23 18:44 GMT+01:00 Adrian Reber <adrian at lisas.de>:
> On Sat, Jun 23, 2018 at 02:11:10PM +0100, Thouraya TH wrote:
> > *MPI applications would be to be aware of the communication that is going
> > on and try to restore that communication state after the process
> restore. *
> >
> > This is about MPI library https://www.open-mpi.org/
> > 1) Running HPC applications, in containers, is gaining significant
> interest
> > due to lighweight virtualisation of containers versus VMs (as i know).
>
> Not sure if you are actually asking something here or not. The thing you
> need to be concerned about are communication messages which are on the
> fly. Especially if the underlying technology is reliable and does not
> lose any messages. This can lead to a situation like this
> (theoretically):
>
> * Side A sends message M1 to side B
> * Side B is checkpointed and stopped before receiving message M1
> * Side A waits for an answer from side B (message M2)
> * Network discards the message M1 as receiver on side B is gone
> * Side B is restored and waits for ever for message M1
>
> * Side B keeps now waiting for message M1
> * Side A keeps waiting for message M2
>
> And now both sides are waiting for ever and you would need to replay
> message M1.
>
> So coordinated checkpointing, where you can make sure that no messages
> are currently on the fly, would make it easier.
>
Ok, we can use coordinated or uncoordinated checkpointing using CRIU ?
>
> > And, what about web applications (web client - Mysql server application
> in
> > a container lxc- Tomcat web server in a container )? There is a
> > communication also.
> > 2) If i would like to save snapshots using criu of this application,
> > therefore i have to restore that communication state after the process
> > restore ?
>
> The difference here is that CRIU has support for TCP/UDP communication
> and can restore the connection state of the sockets. For TCP the problem
> described above does not really exist as the protocol has mechanisms to
> re-transmit messages (if the downtime is short and no timeouts have been
> triggered). With UDP you probably could end up in a situation like
> described above, but the application writer should expect that UDP
> messages can get lost and handle it appropriately.
>
OK.
>
> > 3) I ask also if checkpoint/restore is useful for this kind of
> application ?
>
> See above. Yes.
>
> Especially in HPC you need to know what you want. Do you want fault
> tolerance and restore an application from a periodically taken
> checkpoint. Or do you want to be able to migrate applications/containers
> for other reasons like load balancing. Depending on the use case you
> need different strategies for the communication on the wire and it
> also depends on the used protocols.
>
Yes, i want fault tolerance and restore an application from a periodically
taken
checkpoint (application which communicate such mpi).
Kind regards.
>
> Adrian
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20180623/74d1443b/attachment-0001.html>
More information about the CRIU
mailing list