[CRIU] [PATCH] Try to include userfaultfd with criu
Pavel Emelyanov
xemul at parallels.com
Thu Nov 19 05:00:35 PST 2015
On 11/16/2015 05:31 PM, Adrian Reber wrote:
> From: Adrian Reber <areber at redhat.com>
>
> This is a first try to include userfaultfd with criu. Right now it
> still requires a "normal" checkpoint. After checkpointing the application
> it can be restored with the help of userfaultfd.
>
> All restored pages with MAP_ANONYMOUS set are marked as being handled by
> userfaultfd and also madvise()'d as MADV_DONTNEED (still need to
> understand why MADV_DONTNEED is necessary).
>
> As soon as the process is restored it blocks on the first memory access
> and waits for pages being transferred by userfaultfd.
>
> To handle the required pages a new criu command has been added. The restore
> works now like this:
>
> criu restore -D /tmp/3 -j -v4 --lazy-pages
>
> This hangs after the restored process is running and needs:
>
> criu uffd -v4 -D /tmp/3/
How about naming this action "lazy-pages"?
> This waits on a UFFD FD (/tmp/userfault.socket) which has been passed by
> the 'criu restore' process over unix domain sockets for UFFD requests.
> For my current test program following pages are transmitted over UFFD:
>
> uffdio_copy.dst 0x7ffdeaff9000
> ioctl UFFDIO_COPY rc 0x0
> uffdio_copy.copy 0x1000
>
> uffdio_copy.dst 0x7fb845e88000
> ioctl UFFDIO_COPY rc 0x0
> uffdio_copy.copy 0x1000
>
> uffdio_copy.dst 0x7ffdeafa6000
> ioctl UFFDIO_COPY rc 0x0
> uffdio_copy.copy 0x1000
>
> uffdio_copy.dst 0x7fb845e95000
> ioctl UFFDIO_COPY rc 0x0
> uffdio_copy.copy 0x1000
>
> uffdio_copy.dst 0x7fb845e92000
> ioctl UFFDIO_COPY rc 0x0
> uffdio_copy.copy 0x1000
>
> uffdio_copy.dst 0x7fb845c70000
> ioctl UFFDIO_COPY rc 0x0
> uffdio_copy.copy 0x1000
>
> uffdio_copy.dst 0x7fb845c6d000
> ioctl UFFDIO_COPY rc 0x0
> uffdio_copy.copy 0x1000
>
> uffdio_copy.dst 0x1790000
> ioctl UFFDIO_COPY rc 0x0
> uffdio_copy.copy 0x1000
>
> The use case to use usefaultfd with a checkpointed process on a remote
> machine will probably benefit from the current work related to
> image-cache and image-proxy.
>
> For the final implementation it would be nice to have a restore running
> in uffd mode on one system which requests the memory pages over the
> network from another system which is running 'criu checkpoint' also in
> uffd mode. This way the pages need to be copied only 'once' from the
> checkpoint process to the uffd restore process.
Yup, we will also need something the --lazy-pages option for dump that
would avoid dumping memory, but will instead set up the sender to feed
pages to the restore side.
> TODO:
> * What happens with pages which have not been requested via uffd
> during a certain timeframe. How can pages be forced into the
> restored process?
AFAIU you can forcibly call uffd COPY ioctl to push the page into the
address space.
> * Contains still many debug outputs which need to be cleaned up.
:) Yes, logging in CRIU is quite messy.
> v2:
> * provide option '--lazy-pages' to enable uffd style restore
> * use send_fd()/recv_fd() provided by criu (instead of own
> implementation)
> * do not install the uffd as service_fd
> * use named constants for MAP_ANONYMOUS
> * do not restore memory pages and then later mark them as uffd
> handled
> * remove function find_pages() to search in pages-<id>.img;
> now using criu functions to find the necessary pages;
> for each new page search the pages-<id>.img file is opened
> * only check the UFFDIO_API once
> * trying to protect uffd code by CONFIG_UFFD;
> use make UFFD=1 to compile criu with this patch
>
> Signed-off-by: Adrian Reber <areber at redhat.com>
> ---
> Makefile | 4 +
> Makefile.config | 3 +
> Makefile.crtools | 3 +
> cr-restore.c | 150 ++++++++++++++++++++++++++++++++++-
> crtools.c | 20 +++++
> include/cr_options.h | 1 +
> include/crtools.h | 2 +
> include/page-read.h | 2 +
> include/restorer.h | 2 +
> include/uffd.h | 18 +++++
> include/util-pie.h | 2 +-
> page-read.c | 13 ++++
> pie/restorer.c | 77 +++++++++++++++++-
> uffd.c | 216 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 14 files changed, 508 insertions(+), 5 deletions(-)
> create mode 100644 include/uffd.h
> create mode 100644 uffd.c
>
> diff --git a/cr-restore.c b/cr-restore.c
> index c132588..43d4ce0 100644
> --- a/cr-restore.c
> +++ b/cr-restore.c
> @@ -19,6 +19,7 @@
> #include <sys/shm.h>
> #include <sys/mount.h>
> #include <sys/prctl.h>
> +#include <sys/syscall.h>
>
> #include <sched.h>
>
> @@ -78,6 +79,8 @@
> #include "seccomp.h"
> #include "bitmap.h"
> #include "fault-injection.h"
> +#include "uffd.h"
> +
> #include "parasite-syscall.h"
>
> #include "protobuf.h"
> @@ -463,6 +466,16 @@ static int restore_priv_vma_content(void)
> p = decode_pointer((off) * PAGE_SIZE +
> vma->premmaped_addr);
>
> + /*
> + * This means that userfaultfd is used to load the pages
> + * on demand.
> + */
> + if (opts.lazy_pages && (vma->e->flags & MAP_ANONYMOUS)) {
Should the check for the region being non-shared also be here? AFAIR uffd still
doesn't work with tmpfs (i.e. -- anon shared) regions.
> + pr_debug("Lazy restore skips %lx\n", vma->e->start);
> + pr.skip_pages(&pr, PAGE_SIZE);
> + continue;
> + }
> +
> set_bit(off, vma->page_bitmap);
> if (vma->ppage_bitmap) { /* inherited vma */
> clear_bit(off, vma->ppage_bitmap);
> @@ -2980,6 +3112,22 @@ static int sigreturn_restore(pid_t pid, CoreEntry *core)
>
> strncpy(task_args->comm, core->tc->comm, sizeof(task_args->comm));
>
> + if (!opts.lazy_pages)
> + task_args->uffd = -1;
> +
> +#ifdef CONFIG_UFFD
> + /*
> + * Open userfaulfd FD which is passed to the restorer blob and
> + * to a second process handling the userfaultfd page faults.
> + */
> + task_args->uffd = syscall(__NR_userfaultfd, O_CLOEXEC);
> + pr_info("uffd %d\n", task_args->uffd);
> +
> + if (send_uffd(task_args->uffd) < 0) {
Inside this call you create socket, make it listen(), then accept(). Why?
Shouldn't we instead do it opposite -- connect to the lazy-pages server
and send it uffd?
> + close(task_args->uffd);
> + goto err;
> + }
> +#endif
>
> /*
> * Fill up per-thread data.
> diff --git a/pie/restorer.c b/pie/restorer.c
> index 26494f9..29b14df 100644
> --- a/pie/restorer.c
> +++ b/pie/restorer.c
> @@ -867,6 +922,22 @@ long __export_restore_task(struct task_restore_args *args)
>
> pr_info("Switched to the restorer %d\n", my_pid);
>
> + if (args->uffd > -1) {
> + pr_info("logfd %d\n", args->logfd);
> + pr_info("uffd %d\n", args->uffd);
> +
> + uffd_flags = sys_fcntl(args->uffd, F_GETFD, 0);
> + pr_info("uffd_flags %d\n", uffd_flags);
> + pr_info("UFFD_API 0x%llx\n", UFFD_API);
> + uffdio_api.api = UFFD_API;
> + uffdio_api.features = 0;
> + rc = sys_ioctl(args->uffd, UFFDIO_API, &uffdio_api);
> + pr_info("ioctl UFFDIO_API rc %d\n", rc);
I guess we need to call UFFDIO_API before sending the uffd descriptor to the server,
so that it can start to do whatever it requires with it.
> + pr_info("uffdio_api.api 0x%llx\n", uffdio_api.api);
> + pr_info("uffdio_api.features 0x%llx\n", uffdio_api.features);
> + }
> +
> +
> if (vdso_do_park(&args->vdso_sym_rt, args->vdso_rt_parked_at, vdso_rt_size))
> goto core_restore_end;
>
> @@ -888,7 +959,7 @@ long __export_restore_task(struct task_restore_args *args)
> break;
>
> if (vma_remap(vma_premmaped_start(vma_entry),
> - vma_entry->start, vma_entry_len(vma_entry)))
> + vma_entry->start, vma_entry_len(vma_entry), args->uffd, vma_entry->flags))
> goto core_restore_end;
> }
>
> @@ -906,7 +977,7 @@ long __export_restore_task(struct task_restore_args *args)
> break;
>
> if (vma_remap(vma_premmaped_start(vma_entry),
> - vma_entry->start, vma_entry_len(vma_entry)))
> + vma_entry->start, vma_entry_len(vma_entry), args->uffd, vma_entry->flags))
> goto core_restore_end;
> }
>
-- Pavel
More information about the CRIU
mailing list