<div dir="ltr">Hi, Eric<div><br></div><div>Let's consider a very simple case in Java world:</div><div><br></div><div>```</div><div>Thread.sleep(1);</div><div>```</div><div><br></div><div>The sleep method may be trapped If the restore machine has a smaller boot time, the method returned after twenty minutes in my two machines. Moreover, the blocking method with a timeout, the schedule thread pool also won't work as expected.</div><div><br></div><div>Regards,</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Jun 2, 2018 at 2:46 PM, Andrei Vagin <span dir="ltr"><<a href="mailto:avagin@virtuozzo.com" target="_blank">avagin@virtuozzo.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Fri, Jun 01, 2018 at 10:12:03PM -0500, Eric W. Biederman wrote:<br>
> Andrei Vagin <<a href="mailto:avagin@virtuozzo.com">avagin@virtuozzo.com</a>> writes:<br>
> <br>
> > On Fri, Jun 01, 2018 at 01:20:33PM -0500, Eric W. Biederman wrote:<br>
> >> Adrian Reber <<a href="mailto:adrian@lisas.de">adrian@lisas.de</a>> writes:<br>
> >> <br>
> >> > On Fri, Jun 01, 2018 at 11:04:26AM +0800, yukon wrote:<br>
> >> >> I found that the criu community intent to resolve the timer issue[1], I<br>
> >> >> wonder if there is an issue to<br>
> >> >> track the progress?<br>
> >> ><br>
> >> > I have heard of other people experimenting with it and I also had a few<br>
> >> > patches to try it out. The point where I stopped is when I found out<br>
> >> > that most time calls are actually coming from the VDSO and not from the<br>
> >> > kernel and it is still unclear to me how to handle namespaces and VDSO<br>
> >> > correctly.<br>
> >> ><br>
> >> > I have also talked with Christian (on CC) about it and I also contacted<br>
> >> > Eric at some point (also on CC). Maybe they have more information about<br>
> >> > the current status.<br>
> >> <br>
> >> Andriae. My apologies for not getting back to you earlier (I was<br>
> >> swamped) but that is not a good excuse. I was very impressed by what<br>
> >> you did.<br>
> >> <br>
> >> For me personally I have been looking for a real world case where the<br>
> >> timers matter. Having that would increase the priority of this work<br>
> >> from where I stand.<br>
> >> <br>
> >> To date all I have done is recognize that a time namespace is almost<br>
> >> certainly something that we need, and read the code enough to have a<br>
> >> general sense of how the time infrastructure in the kernel works.<br>
> >> <br>
> >> I think the VDSO has per cpu if not per process constants so we should<br>
> >> be able to affect this in a namespace. If the VDSO does not we<br>
> >> certainly can make that happen.<br>
> >> <br>
> >> I would be very happy to merge a time namespace. I would probably even<br>
> >> start looking at implementation details if I had a compelling test case<br>
> >> in my hand.<br>
> >> <br>
> >> Yukon. I don't have the beginning of this thread. So if you know of a<br>
> >> practical case that does not work because of timers I would love to hear<br>
> >> about it.<br>
> ><br>
> > Hi Eric,<br>
> ><br>
> > We have a practial case. A few CRIU users reported us situations, when<br>
> > applications stop working after migrating them to another host.<br>
> ><br>
> > Usually this means that they use clock_gettime or timer_settime. The<br>
> > problem here is that we can't adjust clocks on a destination host to<br>
> > their values on a source host. For example, the application uses<br>
> > CLOCK_MONOTONIC to measure time slices, but after migrating to another<br>
> > host, clock_gettime(CLOCK_MONOTONIC) may retun a value which is smaller<br>
> > than what was gotten on the source host. The application doesn't expect<br>
> > such behaviour for CLOCK_MONOTONIC, and it probably will work<br>
> > incorrectly (stuck, crash, etc).<br>
> ><br>
> > Here is one quote from the CRIU mailing list:<br>
> ><br>
> > Is there a timeline on when the time namespace might be implemented? Or<br>
> > else is there anyone, even outside CRIU, working on it that you guys<br>
> > know of? It seems like this might be one of the last major obstacles<br>
> > keeping migration from being used in production systems, given that not<br>
> > all containers and connections can be migrated as long as a time<br>
> > dependency is capable of messing it up.<br>
> > <a href="https://github.com/checkpoint-restore/criu/issues/451#issuecomment-386073812" rel="noreferrer" target="_blank">https://github.com/checkpoint-<wbr>restore/criu/issues/451#<wbr>issuecomment-386073812</a><br>
> <br>
> <br>
> Is there an open source application that is known to fail that way?<br>
<br>
</div></div>After a quick search in CRIU github issues, I found two projects:<br>
<br>
RabbitMQ aborted with the " OS monotonic time stepped backwards<br>
Aborted" error. <br>
<br>
<a href="https://github.com/checkpoint-restore/criu/issues/426" rel="noreferrer" target="_blank">https://github.com/checkpoint-<wbr>restore/criu/issues/426</a><br>
<br>
It looks like any program which is written in Erlang has this issue:<br>
<a href="https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_time_sup.c#L299" rel="noreferrer" target="_blank">https://github.com/erlang/otp/<wbr>blob/master/erts/emulator/<wbr>beam/erl_time_sup.c#L299</a><br>
<br>
OracleDB kills itself after C/R<br>
<br>
"Once Oracle monitor process sees that the process start time changes, it<br>
takes a dagger out and makes seppuku."<br>
<br>
<a href="https://medium.com/@kolyshkin/oracle-in-a-docker-container-checkpoint-restore-debug-fun-dda98b7302ed" rel="noreferrer" target="_blank">https://medium.com/@kolyshkin/<wbr>oracle-in-a-docker-container-<wbr>checkpoint-restore-debug-fun-<wbr>dda98b7302ed</a><br>
<span class=""><br>
<br>
> <br>
> I completely believe the issue is real. But it really helps to have<br>
> motivating applications so that some corner case is not skipped.<br>
> <br>
> I will have to look at tcp timestamps, and see how those interact<br>
> with the kernel's timers. To see if that is a time namespace issue.<br>
<br>
</span>Each tcp socket has a timestamp offset, and criu sets it on restore.<br>
<br>
commit 93be6ce0e91b6a94783e012b1857a3<wbr>47a5e6e9f2<br>
Author: Andrey Vagin <<a href="mailto:avagin@openvz.org">avagin@openvz.org</a>><br>
Date: Mon Feb 11 05:50:18 2013 +0000<br>
<br>
tcp: set and get per-socket timestamp<br>
<br>
A timestamp can be set, only if a socket is in the repair mode.<br>
<br>
This patch adds a new socket option TCP_TIMESTAMP, which allows to<br>
get and set current tcp times stamp.<br>
<div class="HOEnZb"><div class="h5"><br>
> <br>
> Eric<br>
______________________________<wbr>_________________<br>
CRIU mailing list<br>
<a href="mailto:CRIU@openvz.org">CRIU@openvz.org</a><br>
<a href="https://lists.openvz.org/mailman/listinfo/criu" rel="noreferrer" target="_blank">https://lists.openvz.org/<wbr>mailman/listinfo/criu</a><br>
</div></div></blockquote></div><br></div>