<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">2018-06-23 19:49 GMT+01:00 Thouraya TH <span dir="ltr"><<a href="mailto:thouraya87@gmail.com" target="_blank">thouraya87@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span class="">2018-06-23 18:44 GMT+01:00 Adrian Reber <span dir="ltr"><<a href="mailto:adrian@lisas.de" target="_blank">adrian@lisas.de</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Sat, Jun 23, 2018 at 02:11:10PM +0100, Thouraya TH wrote:<br>
> *MPI applications would be to be aware of the communication that is going<br>
> on and try to restore that communication state after the process restore. *<br>
<span>> <br>
> This is about MPI library <a href="https://www.open-mpi.org/" rel="noreferrer" target="_blank">https://www.open-mpi.org/</a><br>
> 1) Running HPC applications, in containers, is gaining significant interest<br>
> due to lighweight virtualisation of containers versus VMs (as i know).<br>
<br>
</span>Not sure if you are actually asking something here or not. The thing you<br>
need to be concerned about are communication messages which are on the<br>
fly. Especially if the underlying technology is reliable and does not<br>
lose any messages. This can lead to a situation like this<br>
(theoretically):<br>
<br>
* Side A sends message M1 to side B<br>
* Side B is checkpointed and stopped before receiving message M1<br>
* Side A waits for an answer from side B (message M2)<br>
* Network discards the message M1 as receiver on side B is gone<br>
* Side B is restored and waits for ever for message M1 <br>
<br>
* Side B keeps now waiting for message M1<br>
* Side A keeps waiting for message M2<br>
<br>
And now both sides are waiting for ever and you would need to replay<br>
message M1.<br>
<br>
So coordinated checkpointing, where you can make sure that no messages<br>
are currently on the fly, would make it easier.<br></blockquote><div><br></div></span><div><b>Ok, we can use coordinated or uncoordinated checkpointing using CR</b>IU ? </div></div></div></div></blockquote><div>
<h3><span class="gmail-mw-headline" id="gmail-OpenMPI">OpenMPI</span></h3><p><span style="padding:5px;font-size:120%;border-left:1em solid brown">Status: stalled</span>
</p>
<ul><li> Adrian Reber <a rel="nofollow" class="external gmail-text" href="https://lisas.de/~adrian/open-mpi.git/">did</a> first version of patches</li></ul>
I see that this is your version. Is it a coordinated or uncoordinated checkpointing ? Kind regards. <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div> <br></div><span class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span><br>
> And, what about web applications (web client - Mysql server application in<br>
> a container lxc- Tomcat web server in a container )? There is a<br>
> communication also.<br>
> 2) If i would like to save snapshots using criu of this application,<br>
> therefore i have to restore that communication state after the process<br>
> restore ?<br>
<br>
</span>The difference here is that CRIU has support for TCP/UDP communication<br>
and can restore the connection state of the sockets. For TCP the problem<br>
described above does not really exist as the protocol has mechanisms to<br>
re-transmit messages (if the downtime is short and no timeouts have been<br>
triggered). With UDP you probably could end up in a situation like<br>
described above, but the application writer should expect that UDP<br>
messages can get lost and handle it appropriately.<br></blockquote><div><br></div></span><div>OK. <br></div><span class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span><br>
> 3) I ask also if checkpoint/restore is useful for this kind of application ?<br>
<br>
</span>See above. Yes.<br>
<br>
Especially in HPC you need to know what you want. Do you want fault<br>
tolerance and restore an application from a periodically taken<br>
checkpoint. Or do you want to be able to migrate applications/containers<br>
for other reasons like load balancing. Depending on the use case you<br>
need different strategies for the communication on the wire and it<br>
also depends on the used protocols.<br></blockquote><div><br></div></span><div> Yes, i
want fault tolerance and restore an application from a periodically taken<br>
checkpoint (application which communicate such mpi). <br></div><div><br></div><div>Kind regards. <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class="m_-7776601518110494961HOEnZb"><font color="#888888"><br>
Adrian<br>
</font></span></blockquote></div><br></div></div>
</blockquote></div><br></div></div>