[CRIU] Question: Tightly coupled applications

Thouraya TH thouraya87 at gmail.com
Sun Jun 24 19:56:52 MSK 2018


OK.
Thank you so much for all informations.
Best regards.

2018-06-24 17:28 GMT+01:00 Adrian Reber <adrian at lisas.de>:

> On Sun, Jun 24, 2018 at 02:05:46PM +0100, Thouraya TH wrote:
> > 2018-06-23 19:49 GMT+01:00 Thouraya TH <thouraya87 at gmail.com>:
> >
> > > 2018-06-23 18:44 GMT+01:00 Adrian Reber <adrian at lisas.de>:
> > >
> > >> On Sat, Jun 23, 2018 at 02:11:10PM +0100, Thouraya TH wrote:
> > >> > *MPI applications would be to be aware of the communication that is
> > >> going
> > >> > on and try to restore that communication state after the process
> > >> restore. *
> > >> >
> > >> > This is about MPI library https://www.open-mpi.org/
> > >> > 1) Running HPC applications, in containers, is gaining significant
> > >> interest
> > >> > due to lighweight virtualisation of containers versus VMs (as i
> know).
> > >>
> > >> Not sure if you are actually asking something here or not. The thing
> you
> > >> need to be concerned about are communication messages which are on the
> > >> fly. Especially if the underlying technology is reliable and does not
> > >> lose any messages. This can lead to a situation like this
> > >> (theoretically):
> > >>
> > >>  * Side A sends message M1 to side B
> > >>  * Side B is checkpointed and stopped before receiving message M1
> > >>  * Side A waits for an answer from side B (message M2)
> > >>  * Network discards the message M1 as receiver on side B is gone
> > >>  * Side B is restored and waits for ever for message M1
> > >>
> > >>  * Side B keeps now waiting for message M1
> > >>  * Side A keeps waiting for message M2
> > >>
> > >> And now both sides are waiting for ever and you would need to replay
> > >> message M1.
> > >>
> > >> So coordinated checkpointing, where you can make sure that no messages
> > >> are currently on the fly, would make it easier.
> > >>
> > >
> > > *Ok, we can use coordinated or uncoordinated checkpointing using CR*IU
> ?
> > >
> > OpenMPI
> >
> > Status: stalled
> >
> >    - Adrian Reber did <https://lisas.de/~adrian/open-mpi.git/> first
> >    version of patches
> >
> >   I see that this is your version. Is it a coordinated or uncoordinated
> > checkpointing ? Kind regards.
>
> Actually I do not remember how and what Open MPI does in regards to
> checkpointing. I am guessing it is uncoordinated.
>
> At this point in time, if you want to use CRIU and one of the MPI
> variants, you have to implement it yourself.
>
>                 Adrian
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20180624/a13920c8/attachment-0001.html>


More information about the CRIU mailing list