[CRIU] [PATCH] Try to include userfaultfd with criu

Thu Nov 26 05:41:32 PST 2015

>>> TODO:
>>>   * What happens with pages which have not been requested via uffd
>>>     during a certain timeframe. How can pages be forced into the
>>>     restored process?
>>
>> AFAIU you can forcibly call uffd COPY ioctl to push the page into the
>> address space.
> 
> David Alan Gilbert (in CC), who it implementing the QEMU/KVM side of post-copy
> migration, told (wrote) me how they are doing it:
> 
>  What we do is that we keep sending unrequested pages all the time, but
>  when we get a request for a page from UFF then we send that page immediately.
>  That way very few of the pages actually end up getting a fault, because
>  they're already transferred.  We also tend to only switch into the mode
>  using the UFF after we've sent one full copy of the RAM across; that way
>  it's only pages that are changing that are likely to need to get
>  faulted.

Interesting. But in such case the pages that get fault-ed are those
frequently accessed, while pages that are not transferred via uffd
are likely the least (or almost none) accessible pages.

Wouldn't it be better to transfer the most accessible pages in advance,
then migrate, then pull the least accessible pages via uffd?

>  We also play a trick that when we get a fault and send the requested
>  page, we start sending the pages directly after the requested page, on the
>  assumption that the code that's running might want other pages near it.

Nice trick indeed :)

>>> +				pr_debug("Lazy restore skips %lx\n", vma->e->start);
>>> +				pr.skip_pages(&pr, PAGE_SIZE);
>>> +				continue;
>>> +			}
>>> +
>>>  			set_bit(off, vma->page_bitmap);
>>>  			if (vma->ppage_bitmap) { /* inherited vma */
>>>  				clear_bit(off, vma->ppage_bitmap);
>>
>>> @@ -2980,6 +3112,22 @@ static int sigreturn_restore(pid_t pid, CoreEntry *core)
>>>  
>>>  	strncpy(task_args->comm, core->tc->comm, sizeof(task_args->comm));
>>>  
>>> +	if (!opts.lazy_pages)
>>> +		task_args->uffd = -1;
>>> +
>>> +#ifdef CONFIG_UFFD
>>> +	/*
>>> +	 * Open userfaulfd FD which is passed to the restorer blob and
>>> +	 * to a second process handling the userfaultfd page faults.
>>> +	 */
>>> +	task_args->uffd = syscall(__NR_userfaultfd, O_CLOEXEC);
>>> +	pr_info("uffd %d\n", task_args->uffd);
>>> +
>>> +	if (send_uffd(task_args->uffd) < 0) {
>>
>> Inside this call you create socket, make it listen(), then accept(). Why?
>> Shouldn't we instead do it opposite -- connect to the lazy-pages server
>> and send it uffd?
> 
> Yes, that would be more logical. Will also change this. Or/And the
> lazy-pages server could do the complete uffd setup and the restore
> process will only retrieve the already completely set up uffd. That way
> most of the uffd set up logic could stay in one single place. Does that
> make sense?

But uffd server doesn't know how many uffd-s will be required. Also
it will have to set up the map between uffd-s and pid-s. Both these
make me think that tasks setting up uffd-s, connecting to uffd server
and sending it the fds is the way to go.

-- Pavel