[CRIU] Question: Tightly coupled applications

Thouraya TH thouraya87 at gmail.com
Sun Jun 24 16:05:46 MSK 2018


2018-06-23 19:49 GMT+01:00 Thouraya TH <thouraya87 at gmail.com>:

>
>
> 2018-06-23 18:44 GMT+01:00 Adrian Reber <adrian at lisas.de>:
>
>> On Sat, Jun 23, 2018 at 02:11:10PM +0100, Thouraya TH wrote:
>> > *MPI applications would be to be aware of the communication that is
>> going
>> > on and try to restore that communication state after the process
>> restore. *
>> >
>> > This is about MPI library https://www.open-mpi.org/
>> > 1) Running HPC applications, in containers, is gaining significant
>> interest
>> > due to lighweight virtualisation of containers versus VMs (as i know).
>>
>> Not sure if you are actually asking something here or not. The thing you
>> need to be concerned about are communication messages which are on the
>> fly. Especially if the underlying technology is reliable and does not
>> lose any messages. This can lead to a situation like this
>> (theoretically):
>>
>>  * Side A sends message M1 to side B
>>  * Side B is checkpointed and stopped before receiving message M1
>>  * Side A waits for an answer from side B (message M2)
>>  * Network discards the message M1 as receiver on side B is gone
>>  * Side B is restored and waits for ever for message M1
>>
>>  * Side B keeps now waiting for message M1
>>  * Side A keeps waiting for message M2
>>
>> And now both sides are waiting for ever and you would need to replay
>> message M1.
>>
>> So coordinated checkpointing, where you can make sure that no messages
>> are currently on the fly, would make it easier.
>>
>
> *Ok, we can use coordinated or uncoordinated checkpointing using CR*IU ?
>
OpenMPI

Status: stalled

   - Adrian Reber did <https://lisas.de/~adrian/open-mpi.git/> first
   version of patches

  I see that this is your version. Is it a coordinated or uncoordinated
checkpointing ? Kind regards.

>
>
>> > And, what about web applications (web client - Mysql server application
>> in
>> > a container lxc- Tomcat web server in a container )? There is a
>> > communication also.
>> > 2) If i would like to save snapshots using criu of this application,
>> > therefore i have to restore that communication state after the process
>> > restore ?
>>
>> The difference here is that CRIU has support for TCP/UDP communication
>> and can restore the connection state of the sockets. For TCP the problem
>> described above does not really exist as the protocol has mechanisms to
>> re-transmit messages (if the downtime is short and no timeouts have been
>> triggered). With UDP you probably could end up in a situation like
>> described above, but the application writer should expect that UDP
>> messages can get lost and handle it appropriately.
>>
>
> OK.
>
>>
>> > 3) I ask also if checkpoint/restore is useful for this kind of
>> application ?
>>
>> See above. Yes.
>>
>> Especially in HPC you need to know what you want. Do you want fault
>> tolerance and restore an application from a periodically taken
>> checkpoint. Or do you want to be able to migrate applications/containers
>> for other reasons like load balancing. Depending on the use case you
>> need different strategies for the communication on the wire and it
>> also depends on the used protocols.
>>
>
>  Yes, i want fault tolerance and restore an application from a
> periodically taken
>   checkpoint (application which communicate such mpi).
>
> Kind regards.
>
>>
>>                 Adrian
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20180624/9a60f5b3/attachment.html>


More information about the CRIU mailing list