[CRIU] Lazy-restore design discussion - round 2

Mon Apr 18 04:34:52 PDT 2016

On 04/18/2016 01:47 PM, Adrian Reber wrote:
> On Mon, Apr 18, 2016 at 01:10:05PM +0300, Mike Rapoport wrote:
>> On Mon, Apr 18, 2016 at 12:31:14PM +0300, Pavel Emelyanov wrote:
>>> On 04/18/2016 10:46 AM, Adrian Reber wrote:
>>>> It seems we have reached some kind of agreement and therefore
>>>> I am trying to summarize, from my point of view, our current discussion
>>>> results.
>>>
>>> Thanks for keeping track of this :)
>>
>> Adrian, you've beat me on this :)
>>  
>>>>  * The UFFD daemon does not need a checkpoint directory to run, all
>>>>    required information will be transferred over the network.
>>>>    e.g. PID and pages
>>>
>>> I would still read process tree from images dir.
>>
>> +1
> 
> Hmm, then I don't get it how the patches "lazy-pages: handle multiple
> processes" fit into it.

It doesn't do it yet, indeed. Instead it gets PIDs from the criu restore
(but still -- not from the dump side).

> How can the uffd daemon handle multiple restore
> requests when it needs to know where the checkpoint directory is? 

Ah I see. The "handle multiple processes" in the subject means -- handle
the restore of the whole tree, not handle multiple restore-s.

IOW -- we call criu restore and it goes ahead and forks a tree of tasks
each of which goes to uffd for memory. But when we call for criu restore
for the 2nd time, then we need another uffd instance.

> As I
> understand it I can either start it without '-D' and it gets the
> information about the pagemap-s from somewhere else (the network) or I
> can start the uffd daemon with '-D', but why should it then handle
> multiple requests as all the required information is in a directory
> specified on the command-line, which can change for every restored
> process. From my point of view this seems contradictory.

Well, with the above explanation of where "multiple" plays, I think this
should look like -- one criu restore works with single instance of uffd
that, in turn, cal serve the needs of the whole restored tree. And the
uffd can be started by criu restore automatically OR can be started manually.
In either case it will know the images dir.

BTW, I start to doubt I see the value in starting uffd via commandline :\
It looks like spawning the guy from criu restore would work in any case --
for live migration and for local restore with delayed memory restore.

>>>>  * The page-server protocol needs to be extended to transfer the
>>>>    lazy-restore pages list from the source system to the UFFD daemon.
>>>
>>> You mean the pagemap-s? But we have the images dir on the destination node,
>>> uffd can read this information from there as well.
> 
> See above, this is still unclear to me. Either the uffd daemon can
> handle multiple requests, but then it cannot read information from an
> images directory. We could also transfer the information about images
> directory from the restore process via the same way as the PID and UFFD
> (unix domain socket). Then the uffd daemon would know where the
> directory is but does not need it on the command-line. It would still
> require access to the local file system, which could be avoided by
> transferring the pagemap-s from somewhere else.
> 
>> The means that dump side should be teached to split pagemap creation and
>> actual page dump, right?
> 
> This is part of the tooling work I mentioned below. We probably need
> tools to split a checkpoint into a lazy-part and a part which needs to
> be transferred and the tools also should be able to combine a previously
> lazy-restore checkpoint back to a 'normal' checkpoint directory.

Ah. You're talking about the dump itself and the daemon that would
read lazy pages from tasks and send them over the network?

>>>>  * The UFFD daemon is the instance which decides which pages are pushed
>>>>    when via UFFD into the restored process.
>>>
>>> Not sure I understand this correctly. The UFFD daemon gets #PF-s from tasks
>>> and sends the requests to source node. The source node sends pages onto
>>> destination side (some go in out-of-order mode when #PF request from UFFD
>>> is received). The UFFD injects pages into processes right upon receiving.
> 
> This then also needs to be decided. Which part is responsible for the
> pages transferred. Meaning which part knows which pages have been
> requested, which pages have been transferred and which pages are
> missing. Also improvements to send adjacent pages together needs to be
> handle somewhere. This can either be the uffd-daemon on the destination
> system or the page-server on the source system. For me it feels more
> correctly to be done by the uffd daemon and not by the source system.
> That is also the reason that it either needs to read the image directory
> or get a list of pages in the pagemap suitable for lazy restore. It
> would enable us to leave out any UFFD logic on the source node as it
> only needs to be a page-server. If we implement it on the source node
> parts of the page-server code need to be more UFFD aware (sending
> adjacent pages, sending remaining pages) but on the other hand the uffd
> daemon could be reduced to a simple lazy-pages forwarder.

I see. Well, my current plan is to make uffd be simple and stupid -- it JUST
gets #PFs from uffd-s, sends them to the dump side and injects whatever page
arrives from it back into the process' address spaces. Opposite to it, the
dump side should host a daemon that is "smart" i.e. -- it scans the address
space of the dumped processes and decides which pages to send first and which
next, handling the out-of-order requests from the restore-side daemon.

>>>> Do we agree on these points? If yes, I would like to start to implement
>>>> it that way. If we get to the point where this works it still requires
>>>> lot of work on the tooling. For example how to split out the lazy-pages
>>>> from an existing dump, so that only the non-lazy-pages are actually
>>>> transferred to the destination system.

-- Pavel