[CRIU] Lazy-restore design discussion - round 3

Thu Apr 21 05:45:37 PDT 2016

On Thu, Apr 21, 2016 at 03:22:06PM +0300, Pavel Emelyanov wrote:
> On 04/21/2016 01:42 PM, Mike Rapoport wrote:
> > On Tue, Apr 19, 2016 at 03:41:30PM +0300, Pavel Emelyanov wrote:
> >> On 04/19/2016 02:45 PM, Adrian Reber wrote:
> >>> On Tue, Apr 19, 2016 at 02:21:02PM +0300, Mike Rapoport wrote:
> >>>> On Tue, Apr 19, 2016 at 01:24:09PM +0300, Pavel Emelyanov wrote:
> >>>>> On 04/19/2016 12:39 PM, Adrian Reber wrote:
> >>>>>> The new summary:
> >>>>>>
> >>>>>>  * On the source system there will be process listening on a network
> >>>>>>    socket. In the first implementation it will use a checkpoint
> >>>>>>    directory as the basis for the UFFD pages and in a later version
> >>>>>>    we will add the possibility to transfer the pages directly from the
> >>>>>>    checkpointed process.
> >>>>>
> >>>>> Yes, and in the latter case the daemon will be started automatically by
> >>>>> criu dump.
> >>>>  
> >>>> Why additional process is needed on the dump side? Why the criu dump itself
> >>>> cannot go into "daemon mode" after collecting pagemap's and inserting the
> >>>> memory pages into page-pipe?
> >>>
> >>> I had the same question. But if it fork()'s or uses some other mechanism
> >>> to go into daemon mode sounds like a implementation detail...
> >>
> >> Agreed.
> >>
> >>>>>>  * The UFFD daemon is the instance which decides which pages are pushed
> >>>>>>    when via UFFD into the restored process.
> >>>>>
> >>>>> No, from my perspective uffd daemon (restore side) should be passive and
> >>>>> only forward PF-s to dump side and inject into tasks' address spaces
> >>>>> whatever pages arrive from restore side.
> >>>>
> >>>> This one is tough :)
> >>>
> >>> Yes, it is. This seems to be the main point of discussion.
> >>>
> >>>> I'm more biased towards making the receive side the smart one and the dump
> >>>> side the dumb one.
> >>>
> >>> It seems I am again biased towards the other direction ;-)
> >>>
> >>>> I'd suggest that we start with teaching uffd to get pages over the network
> >>>> instead of checkpoint directory on the destination, and after that works
> >>>> we'll see which side should be the smart one. 
> >>>
> >>> The current implementation I have (on top of Mike's page-server
> >>> extension patch) does exactly that. But if we want the uffd daemon
> >>> (restore side) to be passive then there is no need to open the
> >>> checkpoint directory.
> >>>
> >>> Maybe we really should implement it like Mike said. First try to get the
> >>> current locally on my and on Mike's system existing patches into shape and
> >>> then we can decide if we want to move the page handling logic to the
> >>> dump side on the destination system.
> >>
> >> OK, let's see how it goes.
> >>
> >> But I have one concern about having brains on restore side. Look, the uffd can request
> >> for two kinds (or types) of pages -- those that task are blocked on in #PF (i.e. -- 
> >> explicit uffd requests) and those that task hasn't yet touched (i.e. -- request them
> >> in advance). With the former pages the situation is clear, it's uffd who knows what
> >> these pages are. It can even know something about the latter pages, e.g. with #PF-ed
> >> pages request for adjacent pages as Adrian proposed. That's clear. But what to do
> >> with other "in advance" pages. It seems that it's better to request those pages in
> >> LRU manner, i.e. -- request for recent pages before those that were used long ago. But
> >> the problem I see is that this LRU information can only be obtained from the dump
> >> side -- all this LRU statistics sits _there_. And what would be the way to share
> >> this knowledge with the restore side (as we plan to make it "smart" or "active")?
> >>
> >> Had we the "brain" (or "active part") on dump side we could just scan this info and
> >> make decision. But what to do when we have "brain" on restore side and all the LRU
> >> info on the dump side?
> > 
> > Well, how about somewhat like that:
> > - the restore side requests certain amount of page(s) either because of #PF
> >   or because it's ready to receive "in advance" pages.
> > - the dump side decides what pages need to be sent "in advance" within the
> >   amount requested by the receive side.
> > For example, restore side got a #PF at address X and it is ready to receive
> > 3 pages. The dump side can send pages at (X-2, X-1, X), or at (X-1, X, X+1)
> > or at (X, X+1, X+2). The decision what set to choose may be driven by LRU
> > consideration or/and some other heuristics.
> > As for the "in advance" pages, the restore side just says "I can get 20
> > pages now" and the dump side decides what 20 pages are going to be
> > transferred.
> 
> This splits the "brains" between dump and restore side.

Well, it rather creates two of them :)

> Also -- how does dump side decides how many pages it's ready to receive?

You mean restore, right?
For the beginning I'd go with 1 page at #PF time and "all the rest" when
restore side gets into handle_remaining_pages.
Then, we can experiment with adding several pages to #PF case, starting
fetching "all the rest" in the background and what not...

> > I think in this way we will be able to take LRU into account and, in the
> > same time, we'll have better control of network bandwidth consumption at
> > the restore side.
> 
> Maybe we'll just equip the pagemap.img with additional data -- the relative
> "hotness" of the respective pagemaps and dump side would first request for
> pages from the hottest ones?

Maybe, but I'm not sure that "hotness" of the pagemap gives sufficient
granularity...

> -- Pavel

--
Sincerely yours,
Mike.