[CRIU] [PATCH v6 3/3] Try to include userfaultfd with criu (part 2)
Andrew Vagin
avagin at virtuozzo.com
Mon Apr 11 22:33:10 PDT 2016
Hi Adrian,
It can't be compiled on ppc64le. Could you take a look?
https://ci.openvz.org/job/CRIU/job/CRIU-ppc64le/branch/criu-dev/5/console
On Tue, Mar 15, 2016 at 01:21:13PM +0000, Adrian Reber wrote:
> From: Adrian Reber <areber at redhat.com>
>
> This is a first try to include userfaultfd with criu. Right now it
> still requires a "normal" checkpoint. After checkpointing the
> application it can be restored with the help of userfaultfd.
>
> All restored pages with MAP_ANONYMOUS and MAP_PRIVATE set are marked as
> being handled by userfaultfd.
>
> As soon as the process is restored it blocks on the first memory access
> and waits for pages being transferred by userfaultfd.
>
> To handle the required pages a new criu command has been added. For a
> userfaultfd supported restore the first step is to start the
> 'lazy-pages' server:
>
> criu lazy-pages -v4 -D /tmp/3/ --address /tmp/userfault.socket
>
> This waits on a unix domain socket (defined using the --address option)
> to receive a userfaultfd file descriptor from a '--lazy-pages' enabled
> 'criu restore':
>
> criu restore -D /tmp/3 -j -v4 --lazy-pages \
> --address /tmp/userfault.socket
>
> In the first step the VDSO pages are pushed from the lazy-pages server
> into the restored process. After that the lazy-pages server waits on the
> UFFD FD for a UFFD requested page. If there are no requests received
> during a period of 5 seconds the lazy-pages server switches into a mode
> where the remaining, non-transferred pages are copied into the
> destination process. After all remaining pages have been copied the
> lazy-pages server exits.
>
> The first page that usually is requested is a VDSO page. The process
> currently used for restoring has two VDSO pages, but only one is
> requested
> via userfaultfd. In the second part where the remaining pages are copied
> into the process, the second VDSO page is also copied into the process
> as it has not been requested previously. Unfortunately, even as this
> page has not been requested before, it is not accepted by userfaultfd.
> EINVAL is returned. The reason for EINVAL is not understood and
> therefore
> the VDSO pages are copied first into the process, then switching to
> request
> mode and copying the pages which are requested via userfaultfd. To
> decide at which point the VDSO pages can be copied into the process, the
> lazy-pages server is currently waiting for the first page requested via
> userfaultfd. This is one of the VDSO pages. To not copy a page a second
> time, which is unnecessary and not possible, there is now a check to see
> if the page has been transferred previously.
>
> The use case to use usefaultfd with a checkpointed process on a remote
> machine will probably benefit from the current work related to
> image-cache and image-proxy.
>
> For the final implementation it would be nice to have a restore running
> in uffd mode on one system which requests the memory pages over the
> network from another system which is running 'criu checkpoint' also in
> uffd mode. This way the pages need to be copied only 'once' from the
> checkpoint process to the uffd restore process.
>
> TODO:
> * Contains still many debug outputs which need to be cleaned up.
> * Maybe transfer the dump directory FD also via unix domain sockets
> so that the 'uffd'/'lazy-pages' server can keep running without
> the need to specify the dump directory with '-D'
> * Keep the lazy-pages server running after all pages have been
> transferred and start waiting for new connections to serve.
> * Resurrect the non-cooperative patch set, as once the restored task
> fork()'s or calls mremap() the whole thing becomes broken.
> * Figure out if current VDSO handling is correct.
> * Figure out when and how zero pages need to be inserted via uffd.
>
> v2:
> * provide option '--lazy-pages' to enable uffd style restore
> * use send_fd()/recv_fd() provided by criu (instead of own
> implementation)
> * do not install the uffd as service_fd
> * use named constants for MAP_ANONYMOUS
> * do not restore memory pages and then later mark them as uffd
> handled
> * remove function find_pages() to search in pages-<id>.img;
> now using criu functions to find the necessary pages;
> for each new page search the pages-<id>.img file is opened
> * only check the UFFDIO_API once
> * trying to protect uffd code by CONFIG_UFFD;
> use make UFFD=1 to compile criu with this patch
>
> v3:
> * renamed the server mode from 'uffd' -> 'lazy-pages'
> * switched client and server roles transferring the UFFD FD
> * the criu part running in lazy-pages server mode is now
> waiting for connections
> * the criu restore process connects to the lazy-pages server
> to pass the UFFD FD
> * before UFFD copying anything else the VDSO pages are copied
> as it fails to copy unused VDSO pages once the process is running.
> this was necessary to be able to copy all pages.
> * if there are no more UFFD messages for 5 seconds the lazy-pages
> server switches in copy mode to copy all remaining pages, which
> have not been requested yet, into the restored process
> * check the UFFDIO_API at the correct place
> * close UFFD FD in the restorer to remove open UFFD FD in the
> restored process
>
> v4:
> * removed unnecessary madvise() calls ; it seemed necessary when
> first running tests with uffd; it actually is not necessary
> * auto-detect if build-system provides linux/userfaultfd.h
> header.
> * simplify unix domain socket setup and communication.
> * use --address to specify the location of the used
> unix domain socket.
>
> v5:
> * split the userfaultfd patch in multiple smaller patches
> * introduced vma_can_be_lazy() function to check if a page
> can be handled by uffd
> * moved uffd related code from cr-restore.c to uffd.c
> * handle failure to register a memory page of the restored process
> with userfaultfd
>
> v6:
> * get PID of to be restored process from the 'criu restore' process;
> first the PID is transferred and then the UFFD
>
> Signed-off-by: Adrian Reber <areber at redhat.com>
> ---
> criu/cr-restore.c | 21 +++++++++++
> criu/crtools.c | 14 ++++++++
> criu/include/cr_options.h | 1 +
> criu/include/restorer.h | 2 ++
> criu/include/uffd.h | 6 ++++
> criu/include/vma.h | 9 +++++
> criu/pie/restorer.c | 77 ++++++++++++++++++++++++++++++++++++++---
> criu/uffd.c | 88 +++++++++++++++++++++++++++++++++++++++++++++--
> 8 files changed, 210 insertions(+), 8 deletions(-)
>
> diff --git a/criu/cr-restore.c b/criu/cr-restore.c
> index 9950fa1..2299e95 100644
> --- a/criu/cr-restore.c
> +++ b/criu/cr-restore.c
> @@ -19,6 +19,7 @@
> #include <sys/shm.h>
> #include <sys/mount.h>
> #include <sys/prctl.h>
> +#include <sys/syscall.h>
>
> #include <sched.h>
>
> @@ -76,6 +77,8 @@
> #include "seccomp.h"
> #include "bitmap.h"
> #include "fault-injection.h"
> +#include "uffd.h"
> +
> #include "parasite-syscall.h"
>
> #include "protobuf.h"
> @@ -415,6 +418,7 @@ static int restore_priv_vma_content(void)
> unsigned int nr_shared = 0;
> unsigned int nr_droped = 0;
> unsigned int nr_compared = 0;
> + unsigned int nr_lazy = 0;
> unsigned long va;
> struct page_read pr;
>
> @@ -469,6 +473,17 @@ static int restore_priv_vma_content(void)
> p = decode_pointer((off) * PAGE_SIZE +
> vma->premmaped_addr);
>
> + /*
> + * This means that userfaultfd is used to load the pages
> + * on demand.
> + */
> + if (opts.lazy_pages && vma_entry_can_be_lazy(vma->e)) {
> + pr_debug("Lazy restore skips %lx\n", vma->e->start);
> + pr.skip_pages(&pr, PAGE_SIZE);
> + nr_lazy++;
> + continue;
> + }
> +
> set_bit(off, vma->page_bitmap);
> if (vma->ppage_bitmap) { /* inherited vma */
> clear_bit(off, vma->ppage_bitmap);
> @@ -557,6 +572,7 @@ err_read:
> pr_info("nr_restored_pages: %d\n", nr_restored);
> pr_info("nr_shared_pages: %d\n", nr_shared);
> pr_info("nr_droped_pages: %d\n", nr_droped);
> + pr_info("nr_lazy: %d\n", nr_lazy);
>
> return 0;
>
> @@ -3231,6 +3247,11 @@ static int sigreturn_restore(pid_t pid, CoreEntry *core)
>
> strncpy(task_args->comm, core->tc->comm, sizeof(task_args->comm));
>
> + if (!opts.lazy_pages)
> + task_args->uffd = -1;
> + else
> + if (setup_uffd(task_args, pid) != 0)
> + goto err;
>
> /*
> * Fill up per-thread data.
> diff --git a/criu/crtools.c b/criu/crtools.c
> index 763b5fd..2e85cb8 100644
> --- a/criu/crtools.c
> +++ b/criu/crtools.c
> @@ -319,6 +319,9 @@ int main(int argc, char *argv[], char *envp[])
> { "external", required_argument, 0, 1073 },
> { "empty-ns", required_argument, 0, 1074 },
> { "unshare", required_argument, 0, 1075 },
> +#ifdef CONFIG_HAS_UFFD
> + { "lazy-pages", no_argument, 0, 1076 },
> +#endif
> { },
> };
>
> @@ -572,6 +575,11 @@ int main(int argc, char *argv[], char *envp[])
> if (parse_unshare_arg(optarg))
> return -1;
> break;
> +#ifdef CONFIG_HAS_UFFD
> + case 1076:
> + opts.lazy_pages = true;
> + break;
> +#endif
> case 'M':
> {
> char *aux;
> @@ -813,6 +821,12 @@ usage:
> " --unshare FLAGS what namespaces to unshare when restoring\n"
> " --freeze-cgroup\n"
> " use cgroup freezer to collect processes\n"
> +#ifdef CONFIG_HAS_UFFD
> +" --lazy-pages restore pages on demand\n"
> +" this requires running a second instance of criu\n"
> +" in lazy-pages mode: 'criu lazy-pages -D DIR'\n"
> +" --lazy-pages and lazy-pages mode require userfaultfd\n"
> +#endif
> "\n"
> "* Special resources support:\n"
> " -x|--" USK_EXT_PARAM "inode,.." " allow external unix connections (optionally can be assign socket's inode that allows one-sided dump)\n"
> diff --git a/criu/include/cr_options.h b/criu/include/cr_options.h
> index c1e8fbc..918bb4d 100644
> --- a/criu/include/cr_options.h
> +++ b/criu/include/cr_options.h
> @@ -108,6 +108,7 @@ struct cr_options {
> char *lsm_profile;
> unsigned int timeout;
> unsigned int empty_ns;
> + bool lazy_pages;
> };
>
> extern struct cr_options opts;
> diff --git a/criu/include/restorer.h b/criu/include/restorer.h
> index 9896aa1..86236b0 100644
> --- a/criu/include/restorer.h
> +++ b/criu/include/restorer.h
> @@ -124,6 +124,8 @@ struct task_restore_args {
> int logfd;
> unsigned int loglevel;
>
> + int uffd;
> +
> /* threads restoration */
> int nr_threads; /* number of threads */
> thread_restore_fcall_t clone_restore_fn; /* helper address for clone() call */
> diff --git a/criu/include/uffd.h b/criu/include/uffd.h
> index d5a043b..6c931e2 100644
> --- a/criu/include/uffd.h
> +++ b/criu/include/uffd.h
> @@ -2,6 +2,7 @@
> #define __CR_UFFD_H_
>
> #include "config.h"
> +#include "restorer.h"
>
> #ifdef CONFIG_HAS_UFFD
>
> @@ -11,6 +12,11 @@
> #ifndef __NR_userfaultfd
> #error "missing __NR_userfaultfd definition"
> #endif
> +
> +extern int setup_uffd(struct task_restore_args *task_args, int pid);
> +#else
> +static inline int setup_uffd(struct task_restore_args *task_args, int pid) { return 0; }
> +
> #endif /* CONFIG_HAS_UFFD */
>
> #endif /* __CR_UFFD_H_ */
> diff --git a/criu/include/vma.h b/criu/include/vma.h
> index 247c5a3..28b77b5 100644
> --- a/criu/include/vma.h
> +++ b/criu/include/vma.h
> @@ -7,6 +7,8 @@
>
> #include "images/vma.pb-c.h"
>
> +#include <sys/mman.h>
> +
> struct vm_area_list {
> struct list_head h;
> unsigned nr;
> @@ -107,4 +109,11 @@ static inline bool vma_area_is_private(struct vma_area *vma,
> return vma_entry_is_private(vma->e, task_size);
> }
>
> +static inline bool vma_entry_can_be_lazy(VmaEntry *e)
> +{
> + return ((e->flags & MAP_ANONYMOUS) &&
> + (e->flags & MAP_PRIVATE) &&
> + !(vma_entry_is(e, VMA_AREA_VSYSCALL)));
> +}
> +
> #endif /* __CR_VMA_H__ */
> diff --git a/criu/pie/restorer.c b/criu/pie/restorer.c
> index f7bde75..4a009f4 100644
> --- a/criu/pie/restorer.c
> +++ b/criu/pie/restorer.c
> @@ -25,6 +25,7 @@
> #include "image.h"
> #include "sk-inet.h"
> #include "vma.h"
> +#include "uffd.h"
>
> #include "crtools.h"
> #include "lock.h"
> @@ -567,8 +568,50 @@ static void rst_tcp_socks_all(struct task_restore_args *ta)
> rst_tcp_repair_off(&ta->tcp_socks[i]);
> }
>
> -static int vma_remap(unsigned long src, unsigned long dst, unsigned long len)
> +
> +
> +
> +static int enable_uffd(int uffd, unsigned long addr, unsigned long len)
> {
> + /*
> + * If uffd == -1, this means that userfaultfd is not enabled
> + * or it is not available.
> + */
> + if (uffd == -1)
> + return 0;
> +#ifdef CONFIG_HAS_UFFD
> + int rc;
> + struct uffdio_register uffdio_register;
> + unsigned long expected_ioctls;
> +
> + uffdio_register.range.start = addr;
> + uffdio_register.range.len = len;
> + uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
> + pr_info("lazy-pages: uffdio_register.range.start 0x%lx\n", (unsigned long) uffdio_register.range.start);
> + pr_info("lazy-pages: uffdio_register.len 0x%llx\n", uffdio_register.range.len);
> + rc = sys_ioctl(uffd, UFFDIO_REGISTER, &uffdio_register);
> + pr_info("lazy-pages: ioctl UFFDIO_REGISTER rc %d\n", rc);
> + pr_info("lazy-pages: uffdio_register.range.start 0x%lx\n", (unsigned long) uffdio_register.range.start);
> + pr_info("lazy-pages: uffdio_register.len 0x%llx\n", uffdio_register.range.len);
> + if (rc != 0)
> + return -1;
> +
> + expected_ioctls = (1 << _UFFDIO_WAKE) | (1 << _UFFDIO_COPY) | (1 << _UFFDIO_ZEROPAGE);
> +
> + if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) {
> + pr_err("lazy-pages: unexpected missing uffd ioctl for anon memory\n");
> + }
> +
> +#endif
> + return 0;
> +}
> +
> +
> +static int vma_remap(VmaEntry *vma_entry, int uffd)
> +{
> + unsigned long src = vma_premmaped_start(vma_entry);
> + unsigned long dst = vma_entry->start;
> + unsigned long len = vma_entry_len(vma_entry);
> unsigned long guard = 0, tmp;
>
> pr_info("Remap %lx->%lx len %lx\n", src, dst, len);
> @@ -640,6 +683,18 @@ static int vma_remap(unsigned long src, unsigned long dst, unsigned long len)
> return -1;
> }
>
> + /*
> + * If running in userfaultfd/lazy-pages mode pages with
> + * MAP_ANONYMOUS and MAP_PRIVATE are remapped but without the
> + * real content.
> + * The function enable_uffd() marks the page(s) as userfaultfd
> + * pages, so that the processes will hang until the memory is
> + * injected via userfaultfd.
> + */
> + if (vma_entry_can_be_lazy(vma_entry))
> + if (enable_uffd(uffd, dst, len) != 0)
> + return -1;
> +
> return 0;
> }
>
> @@ -898,6 +953,10 @@ long __export_restore_task(struct task_restore_args *args)
>
> pr_info("Switched to the restorer %d\n", my_pid);
>
> + if (args->uffd > -1) {
> + pr_debug("lazy-pages: uffd %d\n", args->uffd);
> + }
> +
> if (vdso_do_park(&args->vdso_sym_rt, args->vdso_rt_parked_at, vdso_rt_size))
> goto core_restore_end;
>
> @@ -918,8 +977,7 @@ long __export_restore_task(struct task_restore_args *args)
> if (vma_entry->start > vma_entry->shmid)
> break;
>
> - if (vma_remap(vma_premmaped_start(vma_entry),
> - vma_entry->start, vma_entry_len(vma_entry)))
> + if (vma_remap(vma_entry, args->uffd))
> goto core_restore_end;
> }
>
> @@ -936,11 +994,20 @@ long __export_restore_task(struct task_restore_args *args)
> if (vma_entry->start < vma_entry->shmid)
> break;
>
> - if (vma_remap(vma_premmaped_start(vma_entry),
> - vma_entry->start, vma_entry_len(vma_entry)))
> + if (vma_remap(vma_entry, args->uffd))
> goto core_restore_end;
> }
>
> + if (args->uffd > -1) {
> + pr_debug("lazy-pages: closing uffd %d\n", args->uffd);
> + /*
> + * All userfaultfd configuration has finished at this point.
> + * Let's close the UFFD file descriptor, so that the restored
> + * process does not have an opened UFFD FD for ever.
> + */
> + sys_close(args->uffd);
> + }
> +
> /*
> * OK, lets try to map new one.
> */
> diff --git a/criu/uffd.c b/criu/uffd.c
> index 6d8b286..3beae79 100644
> --- a/criu/uffd.c
> +++ b/criu/uffd.c
> @@ -30,6 +30,90 @@
> #undef LOG_PREFIX
> #define LOG_PREFIX "lazy-pages: "
>
> +static int send_uffd(int sendfd, int pid)
> +{
> + int fd;
> + int len;
> + int ret = -1;
> + struct sockaddr_un sun;
> +
> + if (!opts.addr) {
> + pr_info("Please specify a file name for the unix domain socket\n");
> + pr_info("used to communicate between the lazy-pages server\n");
> + pr_info("and the restore process. Use the --address option like\n");
> + pr_info("criu restore --lazy-pages --address /tmp/userfault.socket\n");
> + return -1;
> + }
> +
> + if (sendfd < 0)
> + return -1;
> +
> + if (strlen(opts.addr) >= sizeof(sun.sun_path)) {
> + return -1;
> + }
> +
> + if ((fd = socket(AF_UNIX, SOCK_STREAM, 0)) < 0)
> + return -1;
> +
> + memset(&sun, 0, sizeof(sun));
> + sun.sun_family = AF_UNIX;
> + strcpy(sun.sun_path, opts.addr);
> + len = offsetof(struct sockaddr_un, sun_path) + strlen(opts.addr);
> + if (connect(fd, (struct sockaddr *) &sun, len) < 0) {
> + pr_perror("connect to %s failed", opts.addr);
> + goto out;
> + }
> +
> + /* The "transfer protocol" is first the pid as int and then
> + * the FD for UFFD */
> + pr_debug("Sending PID %d\n", pid);
> + if (send(fd, &pid, sizeof(pid), 0) < 0) {
> + pr_perror("PID sending error:");
> + goto out;
> + }
> +
> + if (send_fd(fd, NULL, 0, sendfd) < 0) {
> + pr_perror("send_fd error:");
> + goto out;
> + }
> + ret = 0;
> +out:
> + close(fd);
> + return ret;
> +}
> +
> +/* This function is used by 'criu restore --lazy-pages' */
> +int setup_uffd(struct task_restore_args *task_args, int pid)
> +{
> + struct uffdio_api uffdio_api;
> + /*
> + * Open userfaulfd FD which is passed to the restorer blob and
> + * to a second process handling the userfaultfd page faults.
> + */
> + task_args->uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> +
> + /*
> + * Check if the UFFD_API is the one which is expected
> + */
> + uffdio_api.api = UFFD_API;
> + uffdio_api.features = 0;
> + if (ioctl(task_args->uffd, UFFDIO_API, &uffdio_api)) {
> + pr_err("Checking for UFFDIO_API failed.\n");
> + return -1;
> + }
> + if (uffdio_api.api != UFFD_API) {
> + pr_err("Result of looking up UFFDIO_API does not match: %Lu\n", uffdio_api.api);
> + return -1;
> + }
> +
> + if (send_uffd(task_args->uffd, pid) < 0) {
> + close(task_args->uffd);
> + return -1;
> + }
> +
> + return 0;
> +}
> +
> static int server_listen(struct sockaddr_un *saddr)
> {
> int fd;
> @@ -232,9 +316,7 @@ static int collect_uffd_pages(struct page_read *pr, struct list_head *uffd_list,
> * in the VMA list.
> */
> if (base >= vma->e->start && base < vma->e->end) {
> - if ((vma->e->flags & MAP_ANONYMOUS) &&
> - (vma->e->flags & MAP_PRIVATE) &&
> - !(vma_area_is(vma, VMA_AREA_VSYSCALL))) {
> + if (vma_entry_can_be_lazy(vma->e)) {
> uffd_page = true;
> if (vma_area_is(vma, VMA_AREA_VDSO))
> uffd_vdso = true;
> --
> 1.8.3.1
>
> _______________________________________________
> CRIU mailing list
> CRIU at openvz.org
> https://lists.openvz.org/mailman/listinfo/criu
More information about the CRIU
mailing list