[CRIU] Lazy-restore design discussion - round 2

Mon Apr 18 08:51:02 PDT 2016

On 04/18/2016 06:28 PM, Adrian Reber wrote:
> On Mon, Apr 18, 2016 at 02:34:52PM +0300, Pavel Emelyanov wrote:
>> On 04/18/2016 01:47 PM, Adrian Reber wrote:
>>> On Mon, Apr 18, 2016 at 01:10:05PM +0300, Mike Rapoport wrote:
>>>> On Mon, Apr 18, 2016 at 12:31:14PM +0300, Pavel Emelyanov wrote:
>>>>> On 04/18/2016 10:46 AM, Adrian Reber wrote:
>>>>>> It seems we have reached some kind of agreement and therefore
>>>>>> I am trying to summarize, from my point of view, our current discussion
>>>>>> results.
>>>>>
>>>>> Thanks for keeping track of this :)
>>>>
>>>> Adrian, you've beat me on this :)
>>>>  
>>>>>>  * The UFFD daemon does not need a checkpoint directory to run, all
>>>>>>    required information will be transferred over the network.
>>>>>>    e.g. PID and pages
>>>>>
>>>>> I would still read process tree from images dir.
>>>>
>>>> +1
>>>
>>> Hmm, then I don't get it how the patches "lazy-pages: handle multiple
>>> processes" fit into it.
>>
>> It doesn't do it yet, indeed. Instead it gets PIDs from the criu restore
>> (but still -- not from the dump side).
>>
>>> How can the uffd daemon handle multiple restore
>>> requests when it needs to know where the checkpoint directory is? 
>>
>> Ah I see. The "handle multiple processes" in the subject means -- handle
>> the restore of the whole tree, not handle multiple restore-s.
>>
>> IOW -- we call criu restore and it goes ahead and forks a tree of tasks
>> each of which goes to uffd for memory. But when we call for criu restore
>> for the 2nd time, then we need another uffd instance.
> 
> So this is all related to Mike's re-send kernel patches. Are the patches
> merged yet?

Nope :( We will try to ping Andrea again soon.

>>> As I
>>> understand it I can either start it without '-D' and it gets the
>>> information about the pagemap-s from somewhere else (the network) or I
>>> can start the uffd daemon with '-D', but why should it then handle
>>> multiple requests as all the required information is in a directory
>>> specified on the command-line, which can change for every restored
>>> process. From my point of view this seems contradictory.
>>
>> Well, with the above explanation of where "multiple" plays, I think this
>> should look like -- one criu restore works with single instance of uffd
>> that, in turn, cal serve the needs of the whole restored tree. And the
>> uffd can be started by criu restore automatically OR can be started manually.
>> In either case it will know the images dir.
>>
>> BTW, I start to doubt I see the value in starting uffd via commandline :\
>> It looks like spawning the guy from criu restore would work in any case --
>> for live migration and for local restore with delayed memory restore.
> 
> Yes, makes sense. All necessary information can be passed to the UFFD
> daemon which is automatically started by 'criu restore'.
> 
>>>>>>  * The page-server protocol needs to be extended to transfer the
>>>>>>    lazy-restore pages list from the source system to the UFFD daemon.
>>>>>
>>>>> You mean the pagemap-s? But we have the images dir on the destination node,
>>>>> uffd can read this information from there as well.
>>>
>>> See above, this is still unclear to me. Either the uffd daemon can
>>> handle multiple requests, but then it cannot read information from an
>>> images directory. We could also transfer the information about images
>>> directory from the restore process via the same way as the PID and UFFD
>>> (unix domain socket). Then the uffd daemon would know where the
>>> directory is but does not need it on the command-line. It would still
>>> require access to the local file system, which could be avoided by
>>> transferring the pagemap-s from somewhere else.
>>>
>>>> The means that dump side should be teached to split pagemap creation and
>>>> actual page dump, right?
>>>
>>> This is part of the tooling work I mentioned below. We probably need
>>> tools to split a checkpoint into a lazy-part and a part which needs to
>>> be transferred and the tools also should be able to combine a previously
>>> lazy-restore checkpoint back to a 'normal' checkpoint directory.
>>
>> Ah. You're talking about the dump itself and the daemon that would
>> read lazy pages from tasks and send them over the network?
>>
>>>>>>  * The UFFD daemon is the instance which decides which pages are pushed
>>>>>>    when via UFFD into the restored process.
>>>>>
>>>>> Not sure I understand this correctly. The UFFD daemon gets #PF-s from tasks
>>>>> and sends the requests to source node. The source node sends pages onto
>>>>> destination side (some go in out-of-order mode when #PF request from UFFD
>>>>> is received). The UFFD injects pages into processes right upon receiving.
>>>
>>> This then also needs to be decided. Which part is responsible for the
>>> pages transferred. Meaning which part knows which pages have been
>>> requested, which pages have been transferred and which pages are
>>> missing. Also improvements to send adjacent pages together needs to be
>>> handle somewhere. This can either be the uffd-daemon on the destination
>>> system or the page-server on the source system. For me it feels more
>>> correctly to be done by the uffd daemon and not by the source system.
>>> That is also the reason that it either needs to read the image directory
>>> or get a list of pages in the pagemap suitable for lazy restore. It
>>> would enable us to leave out any UFFD logic on the source node as it
>>> only needs to be a page-server. If we implement it on the source node
>>> parts of the page-server code need to be more UFFD aware (sending
>>> adjacent pages, sending remaining pages) but on the other hand the uffd
>>> daemon could be reduced to a simple lazy-pages forwarder.
>>
>> I see. Well, my current plan is to make uffd be simple and stupid -- it JUST
>> gets #PFs from uffd-s, sends them to the dump side and injects whatever page
>> arrives from it back into the process' address spaces. Opposite to it, the
>> dump side should host a daemon that is "smart" i.e. -- it scans the address
>> space of the dumped processes and decides which pages to send first and which
>> next, handling the out-of-order requests from the restore-side daemon.
> 
> Also makes sense.
> 
>>>>>> Do we agree on these points? If yes, I would like to start to implement
>>>>>> it that way. If we get to the point where this works it still requires
>>>>>> lot of work on the tooling. For example how to split out the lazy-pages
>>>>>> from an existing dump, so that only the non-lazy-pages are actually
>>>>>> transferred to the destination system.
> 
> We the new results from today, I will send out a new summary tomorrow.

Cool :) Thanks!

-- Pavel