[CRIU] [PATCH v3 2/2] lazy-pages: add support to combine pre-copy and post-copy

Thu Sep 22 01:36:36 PDT 2016

On Thu, Sep 22, 2016 at 09:31:41AM +0200, Adrian Reber wrote:
> On Wed, Sep 21, 2016 at 09:44:49AM +0300, Mike Rapoport wrote:
> > Hi Adrian,
> > 
> > On Tue, Sep 20, 2016 at 06:54:11PM +0200, Adrian Reber wrote:
> > > From: Adrian Reber <areber at redhat.com>
> >  
> > [snip]
> >  
> > > v2:
> > >  - changed parent detection to use pagemap_in_parent()
> > > 
> > > v3:
> > >  - unfortunately this reverts
> > >    c11cf95afbe023a2816a3afaecb65cc4fee670d7
> > >    "criu: mem: skip lazy pages during restore based on pagemap info"
> > >    To be able to split the VMA-s in the right chunks for the restorer
> > >    it is necessary to make the decision lazy or not on the VmaEntry
> > >    level.
> > 
> > I've thought a little bit more about it and I'm not sure it is necessary to
> > split VMAs at all. The restorer can register the entire VMA with
> > userfaultfd, even if the VMA contains pages that are already restored. We
> > just need to make sure uffd.c can properly handle -EEXITS case and it seems
> > we are good.
> > 
> > Consider the following scenario:
> > There is a VMA that spawns from 0x10000 to 0x20000 (16 pages). Let's say
> > that the range from 0x10000 to 0x1a000 is dumped during pre-dump and there
> > were no changes in that memory, so during dump the range 0x10000 - 0x1a0000
> > will be marked with PE_PARENT, and the range 0x1a000 - 0x20000 will be
> > marked PE_LAZY.
> > During restore, the range marked as PE_PARENT will be filled with the
> > content from the disk image and the range marked PE_LAZY will remain
> > unpopulated.
> > restorer will register the entire VMA (0x10000 - 0x20000) with userfaultfd
> > and lazy-pages daemon will consider the entire range as lazy.
> > However, the pages at 0x10000 - 0x1a000 are already present, therefore
> > access to these pages won't cause a page fault.
> 
> Sorry for all the emails. Why will accessing pages in the range 0x10000
> - 0x1a000 not cause a page fault? I see that it works, but I am not sure
> why it does not cause a page fault any more. Is it because we copied
> data to the address before we remap it? I guess I forgot how userfaultfd
> works. We prepare the pages for usefaultfd, then we remap the pages to
> the final destination. But we have never written data to those pages.
> Before or after the remapping. Therefore a page fault occurs. If it
> contains data from a parent checkpoint this means that we have copied
> data to the memory range before remapping it and no page fault occurs.
> If we wanted userfaultfd to work on pages with previously copied data we
> would need to run madvise() on that pages. Ah, I guess I understand it
> again.

When we register a certain memory range with userfaultfd we use
UFFDIO_REGISTER_MODE_MISSING, which means that page faults caused by not
present page will be delivered to the userspace.
But since we already copied the data to some of the pages, they now have
PRESENT bit set in the page table and access to those pages won't generate
page fault.

> 		Adrian
>