[CRIU] Issues with restoring multiple instances of the same source

Ruslan Kuprieiev kupruser at gmail.com
Thu Sep 3 19:17:48 PDT 2015


By sending SIGCONT, I guess.

4 сент. 2015 г. 5:11 AM пользователь "Francisco Tolmasky" <
tolmasky at gmail.com> написал:
>
> Interesting, how would I then “revive” it if I add the final ==
TASK_STOPPED to that line? (Will respond about logs in another email,
longer answer).
>
> On Thu, Sep 3, 2015 at 5:48 PM, Ruslan Kuprieiev <kupruser at gmail.com>
wrote:
>>
>> Hi, Francisco,
>>
>> On 04.09.15 02:05, Francisco Tolmasky wrote:
>>>
>>> So I have been tracking a bug in tonic (related to this logging issue,
and general “breaking” of pipes/streams), and I have narrowed part of the
problem to the fact that we restore multiple containers simultaneously from
the same source run. We do this to have them “warm” and ready in case the
user wants to go back to a previous checkpoint. So something along these
lines happens:
>>>
>>> Program is running -> Checkpoint -> immediate restore IN PARALLEL to
original program/restore from previous checkpoint IN PARALLEL as well.
>>>
>>> So, you can end up with up to 3 copies of the same program running.
Eventually that original one will die and we will choose one of the two
“waiting” copies to pick up from.
>>>
>>> So, my first question is whether you would expect things to start
breaking in this scenario (they seem to work a lot of times, again, we see
occasional failures over time in the form of stream breakages possibly, or
just getting “stuck” (I believe it gets stuck waiting on a pipe though)).
>>>
>>
>> Could you provide some logs, please?
>>
>>> My second question is, if this is in fact not expected to work well,
would it be possible to “Restore” a container but not “start” it. That is,
load up the memory get everything ready but have it waiting for a signal to
actually kick off and get going. That way we can get most the benefit of
pre-warming these restores, without having them all actually running at
once.
>>>
>>
>> That's a great question. I've been thinking about implementing
--leave-stopped for restore, but never actually came to that. I've tried
just adding || opts.final_state == TASK_STOPPED to
https://github.com/xemul/criu/blob/master/cr-restore.c#L1715 and it seems
to work just fine with a test loop, though I'm not sure that it will always
work in more complicated scenarios.
>>
>> Also added this task to TODO list[1].
>>
>> [1] http://criu.org/Todo
>>
>>> Thanks,
>>>
>>> Francisco
>>>
>>> --
>>> Francisco Tolmasky
>>> www.tolmasky.com
>>> tolmasky at gmail.com
>>>
>>>
>>> _______________________________________________
>>> CRIU mailing list
>>> CRIU at openvz.org
>>> https://lists.openvz.org/mailman/listinfo/criu
>>
>>
>
>
>
> --
> Francisco Tolmasky
> www.tolmasky.com
> tolmasky at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20150904/1746bb79/attachment-0001.html>


More information about the CRIU mailing list