[CRIU] Issues with restoring multiple instances of the same source

Francisco Tolmasky tolmasky at gmail.com
Thu Sep 3 19:11:18 PDT 2015


Interesting, how would I then “revive” it if I add the final ==
TASK_STOPPED to that line? (Will respond about logs in another email,
longer answer).

On Thu, Sep 3, 2015 at 5:48 PM, Ruslan Kuprieiev <kupruser at gmail.com> wrote:

> Hi, Francisco,
>
> On 04.09.15 02:05, Francisco Tolmasky wrote:
>
> So I have been tracking a bug in tonic (related to this logging issue, and
> general “breaking” of pipes/streams), and I have narrowed part of the
> problem to the fact that we restore multiple containers simultaneously from
> the same source run. We do this to have them “warm” and ready in case the
> user wants to go back to a previous checkpoint. So something along these
> lines happens:
>
> Program is running -> Checkpoint -> immediate restore IN PARALLEL to
> original program/restore from previous checkpoint IN PARALLEL as well.
>
> So, you can end up with up to 3 copies of the same program running.
> Eventually that original one will die and we will choose one of the two
> “waiting” copies to pick up from.
>
> So, my first question is whether you would expect things to start breaking
> in this scenario (they seem to work a lot of times, again, we see
> occasional failures over time in the form of stream breakages possibly, or
> just getting “stuck” (I believe it gets stuck waiting on a pipe though)).
>
>
> Could you provide some logs, please?
>
> My second question is, if this is in fact not expected to work well, would
> it be possible to “Restore” a container but not “start” it. That is, load
> up the memory get everything ready but have it waiting for a signal to
> actually kick off and get going. That way we can get most the benefit of
> pre-warming these restores, without having them all actually running at
> once.
>
>
> That's a great question. I've been thinking about implementing
> --leave-stopped for restore, but never actually came to that. I've tried
> just adding || opts.final_state == TASK_STOPPED to
> https://github.com/xemul/criu/blob/master/cr-restore.c#L1715 and it seems
> to work just fine with a test loop, though I'm not sure that it will always
> work in more complicated scenarios.
>
> Also added this task to TODO list[1].
>
> [1] http://criu.org/Todo
>
> Thanks,
>
> Francisco
>
> --
> Francisco Tolmasky
> www.tolmasky.com
> tolmasky at gmail.com
>
>
> _______________________________________________
> CRIU mailing listCRIU at openvz.orghttps://lists.openvz.org/mailman/listinfo/criu
>
>
>


-- 
Francisco Tolmasky
www.tolmasky.com
tolmasky at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20150903/e6f9f757/attachment.html>


More information about the CRIU mailing list