[CRIU] [PATCH] Try to include userfaultfd with criu

Wed Nov 25 07:26:59 PST 2015

On Thu, Nov 19, 2015 at 04:00:35PM +0300, Pavel Emelyanov wrote:
> On 11/16/2015 05:31 PM, Adrian Reber wrote:
> > From: Adrian Reber <areber at redhat.com>
> > 
> > This is a first try to include userfaultfd with criu. Right now it
> > still requires a "normal" checkpoint. After checkpointing the application
> > it can be restored with the help of userfaultfd.
> > 
> > All restored pages with MAP_ANONYMOUS set are marked as being handled by
> > userfaultfd and also madvise()'d as MADV_DONTNEED (still need to
> > understand why MADV_DONTNEED is necessary).
> > 
> > As soon as the process is restored it blocks on the first memory access
> > and waits for pages being transferred by userfaultfd.
> > 
> > To handle the required pages a new criu command has been added. The restore
> > works now like this:
> > 
> >   criu restore -D /tmp/3 -j -v4 --lazy-pages
> > 
> > This hangs after the restored process is running and needs:
> > 
> >   criu uffd -v4 -D /tmp/3/
> 
> How about naming this action "lazy-pages"?

No problem.

> > This waits on a UFFD FD (/tmp/userfault.socket) which has been passed by
> > the 'criu restore' process over unix domain sockets for UFFD requests.
> > For my current test program following pages are transmitted over UFFD:
> > 
> >  uffdio_copy.dst 0x7ffdeaff9000
> >  ioctl UFFDIO_COPY rc 0x0
> >  uffdio_copy.copy 0x1000
> > 
> >  uffdio_copy.dst 0x7fb845e88000
> >  ioctl UFFDIO_COPY rc 0x0
> >  uffdio_copy.copy 0x1000
> > 
> >  uffdio_copy.dst 0x7ffdeafa6000
> >  ioctl UFFDIO_COPY rc 0x0
> >  uffdio_copy.copy 0x1000
> > 
> >  uffdio_copy.dst 0x7fb845e95000
> >  ioctl UFFDIO_COPY rc 0x0
> >  uffdio_copy.copy 0x1000
> > 
> >  uffdio_copy.dst 0x7fb845e92000
> >  ioctl UFFDIO_COPY rc 0x0
> >  uffdio_copy.copy 0x1000
> > 
> >  uffdio_copy.dst 0x7fb845c70000
> >  ioctl UFFDIO_COPY rc 0x0
> >  uffdio_copy.copy 0x1000
> > 
> >  uffdio_copy.dst 0x7fb845c6d000
> >  ioctl UFFDIO_COPY rc 0x0
> >  uffdio_copy.copy 0x1000
> > 
> >  uffdio_copy.dst 0x1790000
> >  ioctl UFFDIO_COPY rc 0x0
> >  uffdio_copy.copy 0x1000
> > 
> > The use case to use usefaultfd with a checkpointed process on a remote
> > machine will probably benefit from the current work related to
> > image-cache and image-proxy.
> > 
> > For the final implementation it would be nice to have a restore running
> > in uffd mode on one system which requests the memory pages over the
> > network from another system which is running 'criu checkpoint' also in
> > uffd mode. This way the pages need to be copied only 'once' from the
> > checkpoint process to the uffd restore process.
> 
> Yup, we will also need something the --lazy-pages option for dump that
> would avoid dumping memory, but will instead set up the sender to feed
> pages to the restore side.

Yes, the whole lazy-pages setup needs lots of infrastructure to make it
really useful.

> > TODO:
> >   * What happens with pages which have not been requested via uffd
> >     during a certain timeframe. How can pages be forced into the
> >     restored process?
> 
> AFAIU you can forcibly call uffd COPY ioctl to push the page into the
> address space.

David Alan Gilbert (in CC), who it implementing the QEMU/KVM side of post-copy
migration, told (wrote) me how they are doing it:

 What we do is that we keep sending unrequested pages all the time, but
 when we get a request for a page from UFF then we send that page immediately.
 That way very few of the pages actually end up getting a fault, because
 they're already transferred.  We also tend to only switch into the mode
 using the UFF after we've sent one full copy of the RAM across; that way
 it's only pages that are changing that are likely to need to get
 faulted.

 We also play a trick that when we get a fault and send the requested
 page, we start sending the pages directly after the requested page, on the
 assumption that the code that's running might want other pages near it.

> >   * Contains still many debug outputs which need to be cleaned up.
> 
> :) Yes, logging in CRIU is quite messy.
> 
> > v2:
> >     * provide option '--lazy-pages' to enable uffd style restore
> >     * use send_fd()/recv_fd() provided by criu (instead of own
> >       implementation)
> >     * do not install the uffd as service_fd
> >     * use named constants for MAP_ANONYMOUS
> >     * do not restore memory pages and then later mark them as uffd
> >       handled
> >     * remove function find_pages() to search in pages-<id>.img;
> >       now using criu functions to find the necessary pages;
> >       for each new page search the pages-<id>.img file is opened
> >     * only check the UFFDIO_API once
> >     * trying to protect uffd code by CONFIG_UFFD;
> >       use make UFFD=1 to compile criu with this patch
> > 
> > Signed-off-by: Adrian Reber <areber at redhat.com>
> > ---
> >  Makefile             |   4 +
> >  Makefile.config      |   3 +
> >  Makefile.crtools     |   3 +
> >  cr-restore.c         | 150 ++++++++++++++++++++++++++++++++++-
> >  crtools.c            |  20 +++++
> >  include/cr_options.h |   1 +
> >  include/crtools.h    |   2 +
> >  include/page-read.h  |   2 +
> >  include/restorer.h   |   2 +
> >  include/uffd.h       |  18 +++++
> >  include/util-pie.h   |   2 +-
> >  page-read.c          |  13 ++++
> >  pie/restorer.c       |  77 +++++++++++++++++-
> >  uffd.c               | 216 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  14 files changed, 508 insertions(+), 5 deletions(-)
> >  create mode 100644 include/uffd.h
> >  create mode 100644 uffd.c
> > 
> 
> > diff --git a/cr-restore.c b/cr-restore.c
> > index c132588..43d4ce0 100644
> > --- a/cr-restore.c
> > +++ b/cr-restore.c
> > @@ -19,6 +19,7 @@
> >  #include <sys/shm.h>
> >  #include <sys/mount.h>
> >  #include <sys/prctl.h>
> > +#include <sys/syscall.h>
> >  
> >  #include <sched.h>
> >  
> > @@ -78,6 +79,8 @@
> >  #include "seccomp.h"
> >  #include "bitmap.h"
> >  #include "fault-injection.h"
> > +#include "uffd.h"
> > +
> >  #include "parasite-syscall.h"
> >  
> >  #include "protobuf.h"
> > @@ -463,6 +466,16 @@ static int restore_priv_vma_content(void)
> >  			p = decode_pointer((off) * PAGE_SIZE +
> >  					vma->premmaped_addr);
> >  
> > +			/*
> > +			 * This means that userfaultfd is used to load the pages
> > +			 * on demand.
> > +			 */
> > +			if (opts.lazy_pages && (vma->e->flags & MAP_ANONYMOUS)) {
> 
> Should the check for the region being non-shared also be here? AFAIR uffd still
> doesn't work with tmpfs (i.e. -- anon shared) regions.

Actually, I don't know. But I will add it for the next version of the
patch.

> > +				pr_debug("Lazy restore skips %lx\n", vma->e->start);
> > +				pr.skip_pages(&pr, PAGE_SIZE);
> > +				continue;
> > +			}
> > +
> >  			set_bit(off, vma->page_bitmap);
> >  			if (vma->ppage_bitmap) { /* inherited vma */
> >  				clear_bit(off, vma->ppage_bitmap);
> 
> > @@ -2980,6 +3112,22 @@ static int sigreturn_restore(pid_t pid, CoreEntry *core)
> >  
> >  	strncpy(task_args->comm, core->tc->comm, sizeof(task_args->comm));
> >  
> > +	if (!opts.lazy_pages)
> > +		task_args->uffd = -1;
> > +
> > +#ifdef CONFIG_UFFD
> > +	/*
> > +	 * Open userfaulfd FD which is passed to the restorer blob and
> > +	 * to a second process handling the userfaultfd page faults.
> > +	 */
> > +	task_args->uffd = syscall(__NR_userfaultfd, O_CLOEXEC);
> > +	pr_info("uffd %d\n", task_args->uffd);
> > +
> > +	if (send_uffd(task_args->uffd) < 0) {
> 
> Inside this call you create socket, make it listen(), then accept(). Why?
> Shouldn't we instead do it opposite -- connect to the lazy-pages server
> and send it uffd?

Yes, that would be more logical. Will also change this. Or/And the
lazy-pages server could do the complete uffd setup and the restore
process will only retrieve the already completely set up uffd. That way
most of the uffd set up logic could stay in one single place. Does that
make sense?

> > +		close(task_args->uffd);
> > +		goto err;
> > +	}
> > +#endif
> >  
> >  	/*
> >  	 * Fill up per-thread data.
> 
> > diff --git a/pie/restorer.c b/pie/restorer.c
> > index 26494f9..29b14df 100644
> > --- a/pie/restorer.c
> > +++ b/pie/restorer.c
> > @@ -867,6 +922,22 @@ long __export_restore_task(struct task_restore_args *args)
> >  
> >  	pr_info("Switched to the restorer %d\n", my_pid);
> >  
> > +	if (args->uffd > -1) {
> > +		pr_info("logfd %d\n", args->logfd);
> > +		pr_info("uffd %d\n", args->uffd);
> > +
> > +		uffd_flags = sys_fcntl(args->uffd, F_GETFD, 0);
> > +		pr_info("uffd_flags %d\n", uffd_flags);
> > +		pr_info("UFFD_API 0x%llx\n", UFFD_API);
> > +		uffdio_api.api = UFFD_API;
> > +		uffdio_api.features = 0;
> > +		rc = sys_ioctl(args->uffd, UFFDIO_API, &uffdio_api);
> > +		pr_info("ioctl UFFDIO_API rc %d\n", rc);
> 
> I guess we need to call UFFDIO_API before sending the uffd descriptor to the server,
> so that it can start to do whatever it requires with it.

If we get a completely set up uffd from the lazy-pages server this could
be remove from here.

> > +		pr_info("uffdio_api.api 0x%llx\n", uffdio_api.api);
> > +		pr_info("uffdio_api.features 0x%llx\n", uffdio_api.features);
> > +	}
> > +
> > +
> >  	if (vdso_do_park(&args->vdso_sym_rt, args->vdso_rt_parked_at, vdso_rt_size))
> >  		goto core_restore_end;
> >  
> > @@ -888,7 +959,7 @@ long __export_restore_task(struct task_restore_args *args)
> >  			break;
> >  
> >  		if (vma_remap(vma_premmaped_start(vma_entry),
> > -				vma_entry->start, vma_entry_len(vma_entry)))
> > +				vma_entry->start, vma_entry_len(vma_entry), args->uffd, vma_entry->flags))
> >  			goto core_restore_end;
> >  	}
> >  
> > @@ -906,7 +977,7 @@ long __export_restore_task(struct task_restore_args *args)
> >  			break;
> >  
> >  		if (vma_remap(vma_premmaped_start(vma_entry),
> > -				vma_entry->start, vma_entry_len(vma_entry)))
> > +				vma_entry->start, vma_entry_len(vma_entry), args->uffd, vma_entry->flags))
> >  			goto core_restore_end;
> >  	}
> >  
> 
> -- Pavel

		Adrian