[CRIU] Question: Tightly coupled applications

Adrian Reber adrian at lisas.de
Sun Jun 24 19:28:52 MSK 2018


On Sun, Jun 24, 2018 at 02:05:46PM +0100, Thouraya TH wrote:
> 2018-06-23 19:49 GMT+01:00 Thouraya TH <thouraya87 at gmail.com>:
> 
> > 2018-06-23 18:44 GMT+01:00 Adrian Reber <adrian at lisas.de>:
> >
> >> On Sat, Jun 23, 2018 at 02:11:10PM +0100, Thouraya TH wrote:
> >> > *MPI applications would be to be aware of the communication that is
> >> going
> >> > on and try to restore that communication state after the process
> >> restore. *
> >> >
> >> > This is about MPI library https://www.open-mpi.org/
> >> > 1) Running HPC applications, in containers, is gaining significant
> >> interest
> >> > due to lighweight virtualisation of containers versus VMs (as i know).
> >>
> >> Not sure if you are actually asking something here or not. The thing you
> >> need to be concerned about are communication messages which are on the
> >> fly. Especially if the underlying technology is reliable and does not
> >> lose any messages. This can lead to a situation like this
> >> (theoretically):
> >>
> >>  * Side A sends message M1 to side B
> >>  * Side B is checkpointed and stopped before receiving message M1
> >>  * Side A waits for an answer from side B (message M2)
> >>  * Network discards the message M1 as receiver on side B is gone
> >>  * Side B is restored and waits for ever for message M1
> >>
> >>  * Side B keeps now waiting for message M1
> >>  * Side A keeps waiting for message M2
> >>
> >> And now both sides are waiting for ever and you would need to replay
> >> message M1.
> >>
> >> So coordinated checkpointing, where you can make sure that no messages
> >> are currently on the fly, would make it easier.
> >>
> >
> > *Ok, we can use coordinated or uncoordinated checkpointing using CR*IU ?
> >
> OpenMPI
> 
> Status: stalled
> 
>    - Adrian Reber did <https://lisas.de/~adrian/open-mpi.git/> first
>    version of patches
> 
>   I see that this is your version. Is it a coordinated or uncoordinated
> checkpointing ? Kind regards.

Actually I do not remember how and what Open MPI does in regards to
checkpointing. I am guessing it is uncoordinated.

At this point in time, if you want to use CRIU and one of the MPI
variants, you have to implement it yourself.

		Adrian


More information about the CRIU mailing list