[CRIU] Lazy-restore design discussion - round 3

Thu Apr 21 05:22:06 PDT 2016

On 04/21/2016 01:42 PM, Mike Rapoport wrote:
> On Tue, Apr 19, 2016 at 03:41:30PM +0300, Pavel Emelyanov wrote:
>> On 04/19/2016 02:45 PM, Adrian Reber wrote:
>>> On Tue, Apr 19, 2016 at 02:21:02PM +0300, Mike Rapoport wrote:
>>>> On Tue, Apr 19, 2016 at 01:24:09PM +0300, Pavel Emelyanov wrote:
>>>>> On 04/19/2016 12:39 PM, Adrian Reber wrote:
>>>>>> The new summary:
>>>>>>
>>>>>>  * On the source system there will be process listening on a network
>>>>>>    socket. In the first implementation it will use a checkpoint
>>>>>>    directory as the basis for the UFFD pages and in a later version
>>>>>>    we will add the possibility to transfer the pages directly from the
>>>>>>    checkpointed process.
>>>>>
>>>>> Yes, and in the latter case the daemon will be started automatically by
>>>>> criu dump.
>>>>  
>>>> Why additional process is needed on the dump side? Why the criu dump itself
>>>> cannot go into "daemon mode" after collecting pagemap's and inserting the
>>>> memory pages into page-pipe?
>>>
>>> I had the same question. But if it fork()'s or uses some other mechanism
>>> to go into daemon mode sounds like a implementation detail...
>>
>> Agreed.
>>
>>>>>>  * The UFFD daemon is the instance which decides which pages are pushed
>>>>>>    when via UFFD into the restored process.
>>>>>
>>>>> No, from my perspective uffd daemon (restore side) should be passive and
>>>>> only forward PF-s to dump side and inject into tasks' address spaces
>>>>> whatever pages arrive from restore side.
>>>>
>>>> This one is tough :)
>>>
>>> Yes, it is. This seems to be the main point of discussion.
>>>
>>>> I'm more biased towards making the receive side the smart one and the dump
>>>> side the dumb one.
>>>
>>> It seems I am again biased towards the other direction ;-)
>>>
>>>> I'd suggest that we start with teaching uffd to get pages over the network
>>>> instead of checkpoint directory on the destination, and after that works
>>>> we'll see which side should be the smart one. 
>>>
>>> The current implementation I have (on top of Mike's page-server
>>> extension patch) does exactly that. But if we want the uffd daemon
>>> (restore side) to be passive then there is no need to open the
>>> checkpoint directory.
>>>
>>> Maybe we really should implement it like Mike said. First try to get the
>>> current locally on my and on Mike's system existing patches into shape and
>>> then we can decide if we want to move the page handling logic to the
>>> dump side on the destination system.
>>
>> OK, let's see how it goes.
>>
>> But I have one concern about having brains on restore side. Look, the uffd can request
>> for two kinds (or types) of pages -- those that task are blocked on in #PF (i.e. -- 
>> explicit uffd requests) and those that task hasn't yet touched (i.e. -- request them
>> in advance). With the former pages the situation is clear, it's uffd who knows what
>> these pages are. It can even know something about the latter pages, e.g. with #PF-ed
>> pages request for adjacent pages as Adrian proposed. That's clear. But what to do
>> with other "in advance" pages. It seems that it's better to request those pages in
>> LRU manner, i.e. -- request for recent pages before those that were used long ago. But
>> the problem I see is that this LRU information can only be obtained from the dump
>> side -- all this LRU statistics sits _there_. And what would be the way to share
>> this knowledge with the restore side (as we plan to make it "smart" or "active")?
>>
>> Had we the "brain" (or "active part") on dump side we could just scan this info and
>> make decision. But what to do when we have "brain" on restore side and all the LRU
>> info on the dump side?
> 
> Well, how about somewhat like that:
> - the restore side requests certain amount of page(s) either because of #PF
>   or because it's ready to receive "in advance" pages.
> - the dump side decides what pages need to be sent "in advance" within the
>   amount requested by the receive side.
> For example, restore side got a #PF at address X and it is ready to receive
> 3 pages. The dump side can send pages at (X-2, X-1, X), or at (X-1, X, X+1)
> or at (X, X+1, X+2). The decision what set to choose may be driven by LRU
> consideration or/and some other heuristics.
> As for the "in advance" pages, the restore side just says "I can get 20
> pages now" and the dump side decides what 20 pages are going to be
> transferred.

This splits the "brains" between dump and restore side. Also -- how does
dump side decides how many pages it's ready to receive?

> I think in this way we will be able to take LRU into account and, in the
> same time, we'll have better control of network bandwidth consumption at
> the restore side.

Maybe we'll just equip the pagemap.img with additional data -- the relative
"hotness" of the respective pagemaps and dump side would first request for
pages from the hottest ones?

-- Pavel