[CRIU] Reg unix stream checkpointing and other issues.

Fri Oct 24 09:14:07 PDT 2014

On 10/24/2014 05:51 PM, Sanidhya Kashyap wrote:
> Hi Pavel,
> 
> On Fri, Sep 12, 2014 at 12:33 PM, Pavel Emelyanov <xemul at parallels.com> wrote:
> 
>>>> We can do incremental restore, but it's quite tricky. The process
>>>> of migration would look like this then.
>>>>
>>>> 1. Get the process tree and their memory
>>>> 2. Go on restore node, fork tasks and put the memory in places
>>>> 3. Go back on source node, get the tree and changed memory
>>>> 4. Go on restore node, fixup the tree by killing died tasks
>>>>    and forking the appeared ones, then update their memory
>>>> 5. Repeat steps 3 and 4 some more times
>>>>
>>>> The trickiest part is step #4. I have no nice algorithm for "fixup the tree"
>>>> step of it. Tuning up changed memory is more or less clear how to do.
>>>>
>>>
>>> So, in order to do this, do we need to get some support from the
>>> kernel or criu will be able to manage it?
>>
>> CRIU can manage it, but the algo would be quite tricky :)
>>
> 
> I have been thinking of working on the incremental restore, and would
> like to contribute patches that I develop.

That's awesome!

> But, I have some questions before I decide the approach. I have
> some questions about the one that you have mentioned (above). I wanted
> to discuss in detail as I have already browsed the code.
> 
> - About step 2, you have mentioned about forking tasks and dumping
> memory in places.
> Should all the processes be forked or only a subset of them.

It depends on the algorithm we develop. Maybe it would be enough to
just pre-fork only those with the most of the memory on-board. But
nonetheless, the implementation should work on any tree -- partial
or full.

> - Suppose that I fork a subset of processes and they try to access a
> memory which is shared
> between some other task that has not been forked till now. What will
> happen in that case?

Tasks on pre-restore shouldn't access any memory, they are frozen 
and are controlled by CRIU waiting for the final restore to happen.

Probably you're talking about migrating tree not as a whole, but task
by task. This is another task which differs from the pre-restore.

> - Is there a possibility of having a memory that is not present for
> the task? If yes, then how will that be handled?

Right now no, but there's a work done by Andrea Arcangeli on the
userfaultd and memcopy system calls. I plan to write him an e-mail
about extending this API to fit our needs.

> - IMO, the fixup tree can be done by maintaining the whole process
> structure and we can see
> what is the difference that is existing between old and the existing
> one. Btw, how come will a task die, if that has not started yet?

I don't understand the issue. If a task is present in a pre-restore
tree, but died on source node we should just kill one on the destination.

> - There is a possibility that a forked task might call not yet started
> task. What will happen in this case?

The pre-restored tree is not running, it's frozen.

> Besides this, I was thinking of another approach using userfaultfd.
> That is fork all the tasks but
> don't dump the memory and start the process. Later, when a page is
> accessed, it will result in
> page-fault handler invocation which should be handled by criu handling
> that page. What do you think of this approach?

This is what we call "lazy migration" and yes, this is in our plans
too :) But the existing userfaultfd + memcopy API is not enough. The
latter syscall should operation on arbitrary task VM, not only on the
current one.

Thanks,
Pavel