[CRIU] Question: Tightly coupled applications

Adrian Reber adrian at lisas.de
Sat Jun 23 13:35:07 MSK 2018


On Sat, Jun 23, 2018 at 10:19:13AM +0100, Thouraya TH wrote:
> Please, i have a question about tightly coupled applications and their
> checkpoint
> https://dl.acm.org/citation.cfm?id=568525
> 
> As i know, for this kind of application , i have to record to state of the
> communication channel and the state of each process.
> Following a failure, i have to find the the coherent state to restart
> (coordinated protocol or no coordinated protocol).
> 
> Is there, already, a solution you have proposed to acheive that ?

No, there is nothing I know of. The whole MPI/HPC part of
checkpoint/restore with CRIU has not seen much development in the last
years.

One way to use CRIU in distributed MPI applications would be to be aware
of the communication that is going on and try to restore that
communication state after the process restore.

Another way to use CRIU in MPI applications is to make sure that all
communication has been quiesced before the actual checkpoint/restore.
This probably does not work for fault tolerance.

		Adrian


More information about the CRIU mailing list