<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">2018-06-23 19:49 GMT+01:00 Thouraya TH <span dir="ltr">&lt;<a href="mailto:thouraya87@gmail.com" target="_blank">thouraya87@gmail.com</a>&gt;</span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span class="">2018-06-23 18:44 GMT+01:00 Adrian Reber <span dir="ltr">&lt;<a href="mailto:adrian@lisas.de" target="_blank">adrian@lisas.de</a>&gt;</span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Sat, Jun 23, 2018 at 02:11:10PM +0100, Thouraya TH wrote:<br>

&gt; *MPI applications would be to be aware of the communication that is going<br>

&gt; on and try to restore that communication state after the process restore. *<br>

<span>&gt; <br>

&gt; This is about MPI library <a href="https://www.open-mpi.org/" rel="noreferrer" target="_blank">https://www.open-mpi.org/</a><br>

&gt; 1) Running HPC applications, in containers, is gaining significant interest<br>

&gt; due to lighweight virtualisation of containers versus VMs (as i know).<br>

<br>

</span>Not sure if you are actually asking something here or not. The thing you<br>

need to be concerned about are communication messages which are on the<br>

fly. Especially if the underlying technology is reliable and does not<br>

lose any messages. This can lead to a situation like this<br>

(theoretically):<br>

<br>

 * Side A sends message M1 to side B<br>

 * Side B is checkpointed and stopped before receiving message M1<br>

 * Side A waits for an answer from side B (message M2)<br>

 * Network discards the message M1 as receiver on side B is gone<br>

 * Side B is restored and waits for ever for message M1 <br>

<br>

 * Side B keeps now waiting for message M1<br>

 * Side A keeps waiting for message M2<br>

<br>

And now both sides are waiting for ever and you would need to replay<br>

message M1.<br>

<br>

So coordinated checkpointing, where you can make sure that no messages<br>

are currently on the fly, would make it easier.<br></blockquote><div><br></div></span><div><b>Ok, we can use coordinated or uncoordinated checkpointing using CR</b>IU ? </div></div></div></div></blockquote><div>

<h3><span class="gmail-mw-headline" id="gmail-OpenMPI">OpenMPI</span></h3><p><span style="padding:5px;font-size:120%;border-left:1em solid brown">Status: stalled</span>

</p>

<ul><li> Adrian Reber <a rel="nofollow" class="external gmail-text" href="https://lisas.de/~adrian/open-mpi.git/">did</a> first version of patches</li></ul> 


I see that this is your version. Is it a coordinated or uncoordinated checkpointing ? Kind regards. <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div> <br></div><span class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span><br>

&gt; And, what about web applications (web client - Mysql server application in<br>

&gt; a container lxc- Tomcat web server in a container )? There is a<br>

&gt; communication also.<br>

&gt; 2) If i would like to save snapshots using criu of this application,<br>

&gt; therefore i have to restore that communication state after the process<br>

&gt; restore ?<br>

<br>

</span>The difference here is that CRIU has support for TCP/UDP communication<br>

and can restore the connection state of the sockets. For TCP the problem<br>

described above does not really exist as the protocol has mechanisms to<br>

re-transmit messages (if the downtime is short and no timeouts have been<br>

triggered). With UDP you probably could end up in a situation like<br>

described above, but the application writer should expect that UDP<br>

messages can get lost and handle it appropriately.<br></blockquote><div><br></div></span><div>OK.  <br></div><span class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span><br>

&gt; 3) I ask also if checkpoint/restore is useful for this kind of application ?<br>

<br>

</span>See above. Yes.<br>

<br>

Especially in HPC you need to know what you want. Do you want fault<br>

tolerance and restore an application from a periodically taken<br>

checkpoint. Or do you want to be able to migrate applications/containers<br>

for other reasons like load balancing. Depending on the use case you<br>

need different strategies for the communication on the wire and it<br>

also depends on the used protocols.<br></blockquote><div><br></div></span><div> Yes, i 

 want fault tolerance and restore an application from a periodically taken<br> 

checkpoint (application which communicate such mpi). <br></div><div><br></div><div>Kind regards. <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class="m_-7776601518110494961HOEnZb"><font color="#888888"><br>

                Adrian<br>

</font></span></blockquote></div><br></div></div>

</blockquote></div><br></div></div>