[CRIU] Issues with restoring multiple instances of the same source

Christopher Covington cov at codeaurora.org
Tue Sep 8 08:24:24 PDT 2015


On 09/03/2015 08:48 PM, Ruslan Kuprieiev wrote:
> Hi, Francisco,
> 
> On 04.09.15 02:05, Francisco Tolmasky wrote:
>> So I have been tracking a bug in tonic (related to this logging issue, and
>> general “breaking” of pipes/streams), and I have narrowed part of the
>> problem to the fact that we restore multiple containers simultaneously from
>> the same source run. We do this to have them “warm” and ready in case the
>> user wants to go back to a previous checkpoint. So something along these
>> lines happens:
>>
>> Program is running -> Checkpoint -> immediate restore IN PARALLEL to
>> original program/restore from previous checkpoint IN PARALLEL as well.
>>
>> So, you can end up with up to 3 copies of the same program running.
>> Eventually that original one will die and we will choose one of the two
>> “waiting” copies to pick up from.
>>
>> So, my first question is whether you would expect things to start breaking
>> in this scenario (they seem to work a lot of times, again, we see occasional
>> failures over time in the form of stream breakages possibly, or just getting
>> “stuck” (I believe it gets stuck waiting on a pipe though)). 
>>
> 
> Could you provide some logs, please?
> 
>> My second question is, if this is in fact not expected to work well, would
>> it be possible to “Restore” a container but not “start” it. That is, load up
>> the memory get everything ready but have it waiting for a signal to actually
>> kick off and get going. That way we can get most the benefit of pre-warming
>> these restores, without having them all actually running at once.
>>
> 
> That's a great question. I've been thinking about implementing --leave-stopped
> for restore, but never actually came to that. I've tried just adding ||
> opts.final_state == TASK_STOPPED to
> https://github.com/xemul/criu/blob/master/cr-restore.c#L1715 and it seems to
> work just fine with a test loop, though I'm not sure that it will always work
> in more complicated scenarios.
> 
> Also added this task to TODO list[1].
> 
> [1] http://criu.org/Todo

Why not just have the user `kill -SIGSTOP $pid` prior to dumping?

Christopher Covington

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


More information about the CRIU mailing list