[CRIU] [PATCH] Try to include userfaultfd with criu

Dr. David Alan Gilbert dgilbert at redhat.com
Thu Nov 26 06:54:00 PST 2015


* Pavel Emelyanov (xemul at parallels.com) wrote:
> >>> TODO:
> >>>   * What happens with pages which have not been requested via uffd
> >>>     during a certain timeframe. How can pages be forced into the
> >>>     restored process?
> >>
> >> AFAIU you can forcibly call uffd COPY ioctl to push the page into the
> >> address space.
> > 
> > David Alan Gilbert (in CC), who it implementing the QEMU/KVM side of post-copy
> > migration, told (wrote) me how they are doing it:
> > 
> >  What we do is that we keep sending unrequested pages all the time, but
> >  when we get a request for a page from UFF then we send that page immediately.
> >  That way very few of the pages actually end up getting a fault, because
> >  they're already transferred.  We also tend to only switch into the mode
> >  using the UFF after we've sent one full copy of the RAM across; that way
> >  it's only pages that are changing that are likely to need to get
> >  faulted.
> 
> Interesting. But in such case the pages that get fault-ed are those
> frequently accessed, while pages that are not transferred via uffd
> are likely the least (or almost none) accessible pages.
> 
> Wouldn't it be better to transfer the most accessible pages in advance,
> then migrate, then pull the least accessible pages via uffd?

Of course it's all heuristics; if you have some insight into pages that are
likely to get accessed on a hot path then that's a good idea.
Similarly if you know pages change frequently then there's no point sending
them before you switch into postcopy mode, because you'll likely need to
discard the sent copy because it'll have already changed.

> >  We also play a trick that when we get a fault and send the requested
> >  page, we start sending the pages directly after the requested page, on the
> >  assumption that the code that's running might want other pages near it.
> 
> Nice trick indeed :)

It was easy :-) I've seen people try more complex things (like expanding bubbles
in either side of the request) to try and make better predictions, but it's
hard to know and it's a lot more complex.

Now since you're running userspace code you might have some more clues;
things like libc data structures, pages with locks in etc.  There's potential
to do loads of fun things - e.g. if you knew user data types then you could
follow pointers and predict pages that would be needed next.

Dave

> >>> +				pr_debug("Lazy restore skips %lx\n", vma->e->start);
> >>> +				pr.skip_pages(&pr, PAGE_SIZE);
> >>> +				continue;
> >>> +			}
> >>> +
> >>>  			set_bit(off, vma->page_bitmap);
> >>>  			if (vma->ppage_bitmap) { /* inherited vma */
> >>>  				clear_bit(off, vma->ppage_bitmap);
> >>
> >>> @@ -2980,6 +3112,22 @@ static int sigreturn_restore(pid_t pid, CoreEntry *core)
> >>>  
> >>>  	strncpy(task_args->comm, core->tc->comm, sizeof(task_args->comm));
> >>>  
> >>> +	if (!opts.lazy_pages)
> >>> +		task_args->uffd = -1;
> >>> +
> >>> +#ifdef CONFIG_UFFD
> >>> +	/*
> >>> +	 * Open userfaulfd FD which is passed to the restorer blob and
> >>> +	 * to a second process handling the userfaultfd page faults.
> >>> +	 */
> >>> +	task_args->uffd = syscall(__NR_userfaultfd, O_CLOEXEC);
> >>> +	pr_info("uffd %d\n", task_args->uffd);
> >>> +
> >>> +	if (send_uffd(task_args->uffd) < 0) {
> >>
> >> Inside this call you create socket, make it listen(), then accept(). Why?
> >> Shouldn't we instead do it opposite -- connect to the lazy-pages server
> >> and send it uffd?
> > 
> > Yes, that would be more logical. Will also change this. Or/And the
> > lazy-pages server could do the complete uffd setup and the restore
> > process will only retrieve the already completely set up uffd. That way
> > most of the uffd set up logic could stay in one single place. Does that
> > make sense?
> 
> But uffd server doesn't know how many uffd-s will be required. Also
> it will have to set up the map between uffd-s and pid-s. Both these
> make me think that tasks setting up uffd-s, connecting to uffd server
> and sending it the fds is the way to go.
> 
> -- Pavel
--
Dr. David Alan Gilbert / dgilbert at redhat.com / Manchester, UK


More information about the CRIU mailing list