[CRIU] Combining pre-copy and post-copy

Thu May 18 05:01:20 PDT 2017

On Thu, May 18, 2017 at 12:01:25PM +0300, Pavel Emelyanov wrote:
> On 05/18/2017 08:33 AM, Mike Rapoport wrote:
> > Hi all,
> > 
> > On Mon, Feb 13, 2017 at 11:02:59AM +0100, Adrian Reber wrote:
> >> Hello Mike,
> >>
> >> I have to come back to an old topic from September:
> >>
> >> https://lists.openvz.org/pipermail/criu/2016-September/031672.html
> >>
> >> I am currently trying to restore a process with pre-copy and post-copy
> >> and it fails.
> > 
> > It's been a while since the topic was brought up :)
> > Anyway, after some off-list exchanges and attempts to debug the issue with
> > using pre-copy and post-copy together it seems I have a theory that
> > explains what went wrong.
> > 
> > When we do lazy restore after a round of pre-dump, the memory restore can
> > be roughly outlined as:
> > 
> > criu restore:
> > * Map VMAs at arbitrary address
> > * Fill in the pages that reside in the pre-dump
> > * Remap VMAs to their original address
> > * Register VMAs with uffd
> > 
> > criu-lazy-pages:
> > * Populate pages on demand
> > * Populate remaining pages
> > 
> > Note, that when we do lazy restore *without* pre-dump, the memory of
> > uffd-monitored VMAs is not populated.
> >  
> > Now, to the problem itself. When we fill the memory contents from the
> > pre-dump, mappings are populated with pages which become subject to
> > khugepaged collapses. During khugepaged collapse, there is a new huge page
> > allocated, the content of the original pages is copied there and the new
> > page is mapped into the process address space instead of small pages that
> > were there originally. This effectively kills the non-present gaps that
> > were in the mapping between the pages with content:
> > 
> > address  | small pages       | huge page
> > ---------+-------------------+-----------------
> > 0x1000   | page with data    | page with data
> > 0x2000   | pages not present |
> > ...      |                   |
> > 0x1b000  | pageis with data  |
> > 0x1f000  | pages not present |
> > 0xff000  |                   |
> > 0x100000 | end of 2M region  | end of the page
> > 
> > For the pure lazy restore case, the mappings are empty until they are
> > registered with uffd and khugepaged does not attempt to collapse the pages
> > in uffd-enabled VMAs.
> > 
> > I could think of two possible ways to resolve this issue:
> > 
> > * Disable THP before memory restore and re-enable it once all the VMAs are
> > registered with uffd. The drawback is that it would cause unnecessary huge
> > pages splits and collapses and screwed TLB
> > * Use madvise(MADV_NOHUGEPAGE) before memory restore and
> > madvise(MADV_HUGEPAGE) after VMAs are registered with uffd. It'll work most
> > of the time, but for applications that use this madvise() settings we may
> > get it wrong in the end.
> > * Try to add madvise(MADV_CLR_NOHUGEPAGE) to the kernel. Then we can use
> > madvise(MADV_NOHUGEPAGE) right after mmap() and reset that flag after the
> > VMA is registered with uffd.
> > 
> > Suggestions?
> 
> I'd make khugepaged ignore VMAs that are under UFFD. Or, at least, prevent
> it from merging present pages with non-present (holes) for such VMAs.

khugepaged does not merge holes for VMAs with uffd. The problem is that
there is enough time between restore_priv_vma_content() and enable_uffd()
for khugepaged to scan and merge some pages.
After writing this, I realized that we can enable_uffd() after mmap() and before
restore_priv_vma_content(). Since we can already deal with mremap() in
lazy-pages, it might work...

> -- Pavel
> 
> > --
> > Sincerely yours,
> > Mike.
> > 
> > .
> > 
>