[CRIU] combining pre-copy and post-copy doesn't work anymore

Adrian Reber adrian at lisas.de
Tue Jan 17 06:20:00 PST 2017


On Tue, Jan 17, 2017 at 11:50:37AM +0200, Mike Rapoport wrote:
> > I just tried lazy migration in combination with pre-dumping. In my tests
> > it fails and I wanted to ask if you tested it recently?
> 
> Not really. I've done most of the recent testing with
> zdtm.py --remote-lazy-pages
>  
> > (46.566404) polling timeout
> > (46.566415) lazy-pages: Start handling remaining pages
> > (46.566417) pr2 Read b0e000 1 pages
> > (46.580431) lazy-pages: 1119: uffd_copy: 0xb0e000/4096
> > (46.580447) Error (criu/uffd.c:529): lazy-pages: 1119: UFFDIO_COPY failed: rc:-1 copy:-17
> > (46.580452) polling timeout
> > (46.580454) pr2 Read dba000 1 pages
> > 
> > Or is this works as designed as the page is already in the parent
> > image? The error message just sounds a bit scary.
> >
> > The restore does not work, so something seems wrong. The lazy-pages
> > daemon ends with 0 and:
> 
> If you are trying to migrate multithreaded application, you probably hit
> the race between completion of page faults for different threads at the
> same address.
> If you see "no iovs found, zero pages" in the dump log, then something went
> wrong.

The application is not multithreaded as far as I can tell.

-17 is -EEXIST and the only error where the uffd code keeps running. It
is not clear why the kernel returns EEXIST.

Using the error from above and looking at page 0xb0e000/4096:

Looking at the pagemap of the parent (pre-dump) and the
actual dump everything looks correct. Pages are marked as PARENT and as
LAZY as they should. In the initial dump (parent) the page is

        {
            "vaddr": "0xa00000", 
            "nr_pages": 510, 
            "flags": "PE_LAZY | PE_PRESENT"
        }, 

and the second dump (using --prev-images-dir and --track-mem) I see:

        {
            "vaddr": "0xb0e000", 
            "nr_pages": 1, 
            "flags": "PE_LAZY"
        }, 

Which looks correct. During restore the page is also correctly marked as
being handled by uffd:

pie: 24687: Remap 0x7f67a8ba3000->0x8f7000 len 0x1be8000
pie: 24687: lazy-pages: uffdio_register.range.start 0x0x8f7000
pie: 24687: lazy-pages: uffdio_register.len 0x0x1be8000
pie: 24687: lazy-pages: ioctl UFFDIO_REGISTER rc 0
pie: 24687: lazy-pages: uffdio_register.range.start 0x0x8f7000
pie: 24687: lazy-pages: uffdio_register.len 0x0x1be8000

So the page is correctly marked as being handled by userfaultfd.

The lazy restore works if pre-dumping is not involved. But is sounds
like you are not testing the pre-copy and post-copy combination so
something might be broken. I need to look closer into this.

Let me know if you have any ideas what might be wrong.

		Adrian


More information about the CRIU mailing list