[CRIU] Lazy-restore design discussion - round 3

Thu Apr 21 05:48:45 PDT 2016

On 04/21/2016 03:45 PM, Mike Rapoport wrote:
> On Thu, Apr 21, 2016 at 03:22:06PM +0300, Pavel Emelyanov wrote:
>> On 04/21/2016 01:42 PM, Mike Rapoport wrote:
>>> On Tue, Apr 19, 2016 at 03:41:30PM +0300, Pavel Emelyanov wrote:
>>>> On 04/19/2016 02:45 PM, Adrian Reber wrote:
>>>>> On Tue, Apr 19, 2016 at 02:21:02PM +0300, Mike Rapoport wrote:
>>>>>> On Tue, Apr 19, 2016 at 01:24:09PM +0300, Pavel Emelyanov wrote:
>>>>>>> On 04/19/2016 12:39 PM, Adrian Reber wrote:
>>>>>>>> The new summary:
>>>>>>>>
>>>>>>>>  * On the source system there will be process listening on a network
>>>>>>>>    socket. In the first implementation it will use a checkpoint
>>>>>>>>    directory as the basis for the UFFD pages and in a later version
>>>>>>>>    we will add the possibility to transfer the pages directly from the
>>>>>>>>    checkpointed process.
>>>>>>>
>>>>>>> Yes, and in the latter case the daemon will be started automatically by
>>>>>>> criu dump.
>>>>>>  
>>>>>> Why additional process is needed on the dump side? Why the criu dump itself
>>>>>> cannot go into "daemon mode" after collecting pagemap's and inserting the
>>>>>> memory pages into page-pipe?
>>>>>
>>>>> I had the same question. But if it fork()'s or uses some other mechanism
>>>>> to go into daemon mode sounds like a implementation detail...
>>>>
>>>> Agreed.
>>>>
>>>>>>>>  * The UFFD daemon is the instance which decides which pages are pushed
>>>>>>>>    when via UFFD into the restored process.
>>>>>>>
>>>>>>> No, from my perspective uffd daemon (restore side) should be passive and
>>>>>>> only forward PF-s to dump side and inject into tasks' address spaces
>>>>>>> whatever pages arrive from restore side.
>>>>>>
>>>>>> This one is tough :)
>>>>>
>>>>> Yes, it is. This seems to be the main point of discussion.
>>>>>
>>>>>> I'm more biased towards making the receive side the smart one and the dump
>>>>>> side the dumb one.
>>>>>
>>>>> It seems I am again biased towards the other direction ;-)
>>>>>
>>>>>> I'd suggest that we start with teaching uffd to get pages over the network
>>>>>> instead of checkpoint directory on the destination, and after that works
>>>>>> we'll see which side should be the smart one. 
>>>>>
>>>>> The current implementation I have (on top of Mike's page-server
>>>>> extension patch) does exactly that. But if we want the uffd daemon
>>>>> (restore side) to be passive then there is no need to open the
>>>>> checkpoint directory.
>>>>>
>>>>> Maybe we really should implement it like Mike said. First try to get the
>>>>> current locally on my and on Mike's system existing patches into shape and
>>>>> then we can decide if we want to move the page handling logic to the
>>>>> dump side on the destination system.
>>>>
>>>> OK, let's see how it goes.
>>>>
>>>> But I have one concern about having brains on restore side. Look, the uffd can request
>>>> for two kinds (or types) of pages -- those that task are blocked on in #PF (i.e. -- 
>>>> explicit uffd requests) and those that task hasn't yet touched (i.e. -- request them
>>>> in advance). With the former pages the situation is clear, it's uffd who knows what
>>>> these pages are. It can even know something about the latter pages, e.g. with #PF-ed
>>>> pages request for adjacent pages as Adrian proposed. That's clear. But what to do
>>>> with other "in advance" pages. It seems that it's better to request those pages in
>>>> LRU manner, i.e. -- request for recent pages before those that were used long ago. But
>>>> the problem I see is that this LRU information can only be obtained from the dump
>>>> side -- all this LRU statistics sits _there_. And what would be the way to share
>>>> this knowledge with the restore side (as we plan to make it "smart" or "active")?
>>>>
>>>> Had we the "brain" (or "active part") on dump side we could just scan this info and
>>>> make decision. But what to do when we have "brain" on restore side and all the LRU
>>>> info on the dump side?
>>>
>>> Well, how about somewhat like that:
>>> - the restore side requests certain amount of page(s) either because of #PF
>>>   or because it's ready to receive "in advance" pages.
>>> - the dump side decides what pages need to be sent "in advance" within the
>>>   amount requested by the receive side.
>>> For example, restore side got a #PF at address X and it is ready to receive
>>> 3 pages. The dump side can send pages at (X-2, X-1, X), or at (X-1, X, X+1)
>>> or at (X, X+1, X+2). The decision what set to choose may be driven by LRU
>>> consideration or/and some other heuristics.
>>> As for the "in advance" pages, the restore side just says "I can get 20
>>> pages now" and the dump side decides what 20 pages are going to be
>>> transferred.
>>
>> This splits the "brains" between dump and restore side.
> 
> Well, it rather creates two of them :)
> 
>> Also -- how does dump side decides how many pages it's ready to receive?
> 
> You mean restore, right?

Ah, yes :)

> For the beginning I'd go with 1 page at #PF time and "all the rest" when
> restore side gets into handle_remaining_pages.
> Then, we can experiment with adding several pages to #PF case, starting
> fetching "all the rest" in the background and what not...

Hm... OK.

>>> I think in this way we will be able to take LRU into account and, in the
>>> same time, we'll have better control of network bandwidth consumption at
>>> the restore side.
>>
>> Maybe we'll just equip the pagemap.img with additional data -- the relative
>> "hotness" of the respective pagemaps and dump side would first request for
>> pages from the hottest ones?
> 
> Maybe, but I'm not sure that "hotness" of the pagemap gives sufficient
> granularity...

True. OK, let's try to go with brains on the restore side.

-- Pavel