[CRIU] [PATCH] Try to include userfaultfd with criu

Pavel Emelyanov xemul at parallels.com
Thu Nov 19 05:00:35 PST 2015


On 11/16/2015 05:31 PM, Adrian Reber wrote:
> From: Adrian Reber <areber at redhat.com>
> 
> This is a first try to include userfaultfd with criu. Right now it
> still requires a "normal" checkpoint. After checkpointing the application
> it can be restored with the help of userfaultfd.
> 
> All restored pages with MAP_ANONYMOUS set are marked as being handled by
> userfaultfd and also madvise()'d as MADV_DONTNEED (still need to
> understand why MADV_DONTNEED is necessary).
> 
> As soon as the process is restored it blocks on the first memory access
> and waits for pages being transferred by userfaultfd.
> 
> To handle the required pages a new criu command has been added. The restore
> works now like this:
> 
>   criu restore -D /tmp/3 -j -v4 --lazy-pages
> 
> This hangs after the restored process is running and needs:
> 
>   criu uffd -v4 -D /tmp/3/

How about naming this action "lazy-pages"?

> This waits on a UFFD FD (/tmp/userfault.socket) which has been passed by
> the 'criu restore' process over unix domain sockets for UFFD requests.
> For my current test program following pages are transmitted over UFFD:
> 
>  uffdio_copy.dst 0x7ffdeaff9000
>  ioctl UFFDIO_COPY rc 0x0
>  uffdio_copy.copy 0x1000
> 
>  uffdio_copy.dst 0x7fb845e88000
>  ioctl UFFDIO_COPY rc 0x0
>  uffdio_copy.copy 0x1000
> 
>  uffdio_copy.dst 0x7ffdeafa6000
>  ioctl UFFDIO_COPY rc 0x0
>  uffdio_copy.copy 0x1000
> 
>  uffdio_copy.dst 0x7fb845e95000
>  ioctl UFFDIO_COPY rc 0x0
>  uffdio_copy.copy 0x1000
> 
>  uffdio_copy.dst 0x7fb845e92000
>  ioctl UFFDIO_COPY rc 0x0
>  uffdio_copy.copy 0x1000
> 
>  uffdio_copy.dst 0x7fb845c70000
>  ioctl UFFDIO_COPY rc 0x0
>  uffdio_copy.copy 0x1000
> 
>  uffdio_copy.dst 0x7fb845c6d000
>  ioctl UFFDIO_COPY rc 0x0
>  uffdio_copy.copy 0x1000
> 
>  uffdio_copy.dst 0x1790000
>  ioctl UFFDIO_COPY rc 0x0
>  uffdio_copy.copy 0x1000
> 
> The use case to use usefaultfd with a checkpointed process on a remote
> machine will probably benefit from the current work related to
> image-cache and image-proxy.
> 
> For the final implementation it would be nice to have a restore running
> in uffd mode on one system which requests the memory pages over the
> network from another system which is running 'criu checkpoint' also in
> uffd mode. This way the pages need to be copied only 'once' from the
> checkpoint process to the uffd restore process.

Yup, we will also need something the --lazy-pages option for dump that
would avoid dumping memory, but will instead set up the sender to feed
pages to the restore side.

> TODO:
>   * What happens with pages which have not been requested via uffd
>     during a certain timeframe. How can pages be forced into the
>     restored process?

AFAIU you can forcibly call uffd COPY ioctl to push the page into the
address space.

>   * Contains still many debug outputs which need to be cleaned up.

:) Yes, logging in CRIU is quite messy.

> v2:
>     * provide option '--lazy-pages' to enable uffd style restore
>     * use send_fd()/recv_fd() provided by criu (instead of own
>       implementation)
>     * do not install the uffd as service_fd
>     * use named constants for MAP_ANONYMOUS
>     * do not restore memory pages and then later mark them as uffd
>       handled
>     * remove function find_pages() to search in pages-<id>.img;
>       now using criu functions to find the necessary pages;
>       for each new page search the pages-<id>.img file is opened
>     * only check the UFFDIO_API once
>     * trying to protect uffd code by CONFIG_UFFD;
>       use make UFFD=1 to compile criu with this patch
> 
> Signed-off-by: Adrian Reber <areber at redhat.com>
> ---
>  Makefile             |   4 +
>  Makefile.config      |   3 +
>  Makefile.crtools     |   3 +
>  cr-restore.c         | 150 ++++++++++++++++++++++++++++++++++-
>  crtools.c            |  20 +++++
>  include/cr_options.h |   1 +
>  include/crtools.h    |   2 +
>  include/page-read.h  |   2 +
>  include/restorer.h   |   2 +
>  include/uffd.h       |  18 +++++
>  include/util-pie.h   |   2 +-
>  page-read.c          |  13 ++++
>  pie/restorer.c       |  77 +++++++++++++++++-
>  uffd.c               | 216 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  14 files changed, 508 insertions(+), 5 deletions(-)
>  create mode 100644 include/uffd.h
>  create mode 100644 uffd.c
> 

> diff --git a/cr-restore.c b/cr-restore.c
> index c132588..43d4ce0 100644
> --- a/cr-restore.c
> +++ b/cr-restore.c
> @@ -19,6 +19,7 @@
>  #include <sys/shm.h>
>  #include <sys/mount.h>
>  #include <sys/prctl.h>
> +#include <sys/syscall.h>
>  
>  #include <sched.h>
>  
> @@ -78,6 +79,8 @@
>  #include "seccomp.h"
>  #include "bitmap.h"
>  #include "fault-injection.h"
> +#include "uffd.h"
> +
>  #include "parasite-syscall.h"
>  
>  #include "protobuf.h"
> @@ -463,6 +466,16 @@ static int restore_priv_vma_content(void)
>  			p = decode_pointer((off) * PAGE_SIZE +
>  					vma->premmaped_addr);
>  
> +			/*
> +			 * This means that userfaultfd is used to load the pages
> +			 * on demand.
> +			 */
> +			if (opts.lazy_pages && (vma->e->flags & MAP_ANONYMOUS)) {

Should the check for the region being non-shared also be here? AFAIR uffd still
doesn't work with tmpfs (i.e. -- anon shared) regions.

> +				pr_debug("Lazy restore skips %lx\n", vma->e->start);
> +				pr.skip_pages(&pr, PAGE_SIZE);
> +				continue;
> +			}
> +
>  			set_bit(off, vma->page_bitmap);
>  			if (vma->ppage_bitmap) { /* inherited vma */
>  				clear_bit(off, vma->ppage_bitmap);

> @@ -2980,6 +3112,22 @@ static int sigreturn_restore(pid_t pid, CoreEntry *core)
>  
>  	strncpy(task_args->comm, core->tc->comm, sizeof(task_args->comm));
>  
> +	if (!opts.lazy_pages)
> +		task_args->uffd = -1;
> +
> +#ifdef CONFIG_UFFD
> +	/*
> +	 * Open userfaulfd FD which is passed to the restorer blob and
> +	 * to a second process handling the userfaultfd page faults.
> +	 */
> +	task_args->uffd = syscall(__NR_userfaultfd, O_CLOEXEC);
> +	pr_info("uffd %d\n", task_args->uffd);
> +
> +	if (send_uffd(task_args->uffd) < 0) {

Inside this call you create socket, make it listen(), then accept(). Why?
Shouldn't we instead do it opposite -- connect to the lazy-pages server
and send it uffd?

> +		close(task_args->uffd);
> +		goto err;
> +	}
> +#endif
>  
>  	/*
>  	 * Fill up per-thread data.

> diff --git a/pie/restorer.c b/pie/restorer.c
> index 26494f9..29b14df 100644
> --- a/pie/restorer.c
> +++ b/pie/restorer.c
> @@ -867,6 +922,22 @@ long __export_restore_task(struct task_restore_args *args)
>  
>  	pr_info("Switched to the restorer %d\n", my_pid);
>  
> +	if (args->uffd > -1) {
> +		pr_info("logfd %d\n", args->logfd);
> +		pr_info("uffd %d\n", args->uffd);
> +
> +		uffd_flags = sys_fcntl(args->uffd, F_GETFD, 0);
> +		pr_info("uffd_flags %d\n", uffd_flags);
> +		pr_info("UFFD_API 0x%llx\n", UFFD_API);
> +		uffdio_api.api = UFFD_API;
> +		uffdio_api.features = 0;
> +		rc = sys_ioctl(args->uffd, UFFDIO_API, &uffdio_api);
> +		pr_info("ioctl UFFDIO_API rc %d\n", rc);

I guess we need to call UFFDIO_API before sending the uffd descriptor to the server,
so that it can start to do whatever it requires with it.

> +		pr_info("uffdio_api.api 0x%llx\n", uffdio_api.api);
> +		pr_info("uffdio_api.features 0x%llx\n", uffdio_api.features);
> +	}
> +
> +
>  	if (vdso_do_park(&args->vdso_sym_rt, args->vdso_rt_parked_at, vdso_rt_size))
>  		goto core_restore_end;
>  
> @@ -888,7 +959,7 @@ long __export_restore_task(struct task_restore_args *args)
>  			break;
>  
>  		if (vma_remap(vma_premmaped_start(vma_entry),
> -				vma_entry->start, vma_entry_len(vma_entry)))
> +				vma_entry->start, vma_entry_len(vma_entry), args->uffd, vma_entry->flags))
>  			goto core_restore_end;
>  	}
>  
> @@ -906,7 +977,7 @@ long __export_restore_task(struct task_restore_args *args)
>  			break;
>  
>  		if (vma_remap(vma_premmaped_start(vma_entry),
> -				vma_entry->start, vma_entry_len(vma_entry)))
> +				vma_entry->start, vma_entry_len(vma_entry), args->uffd, vma_entry->flags))
>  			goto core_restore_end;
>  	}
>  

-- Pavel



More information about the CRIU mailing list