[CRIU] [PATCH v6 3/3] Try to include userfaultfd with criu (part 2)

Andrew Vagin avagin at virtuozzo.com
Mon Apr 11 22:33:10 PDT 2016


Hi Adrian,

It can't be compiled on ppc64le. Could you take a look?
https://ci.openvz.org/job/CRIU/job/CRIU-ppc64le/branch/criu-dev/5/console

On Tue, Mar 15, 2016 at 01:21:13PM +0000, Adrian Reber wrote:
> From: Adrian Reber <areber at redhat.com>
> 
> This is a first try to include userfaultfd with criu. Right now it
> still requires a "normal" checkpoint. After checkpointing the
> application it can be restored with the help of userfaultfd.
> 
> All restored pages with MAP_ANONYMOUS and MAP_PRIVATE set are marked as
> being handled by userfaultfd.
> 
> As soon as the process is restored it blocks on the first memory access
> and waits for pages being transferred by userfaultfd.
> 
> To handle the required pages a new criu command has been added. For a
> userfaultfd supported restore the first step is to start the
> 'lazy-pages' server:
> 
>   criu lazy-pages -v4 -D /tmp/3/ --address /tmp/userfault.socket
> 
> This waits on a unix domain socket (defined using the --address option)
> to receive a userfaultfd file descriptor from a '--lazy-pages' enabled
> 'criu restore':
> 
>   criu restore -D /tmp/3 -j -v4 --lazy-pages \
>   --address /tmp/userfault.socket
> 
> In the first step the VDSO pages are pushed from the lazy-pages server
> into the restored process. After that the lazy-pages server waits on the
> UFFD FD for a UFFD requested page. If there are no requests received
> during a period of 5 seconds the lazy-pages server switches into a mode
> where the remaining, non-transferred pages are copied into the
> destination process. After all remaining pages have been copied the
> lazy-pages server exits.
> 
> The first page that usually is requested is a VDSO page. The process
> currently used for restoring has two VDSO pages, but only one is
> requested
> via userfaultfd. In the second part where the remaining pages are copied
> into the process, the second VDSO page is also copied into the process
> as it has not been requested previously. Unfortunately, even as this
> page has not been requested before, it is not accepted by userfaultfd.
> EINVAL is returned. The reason for EINVAL is not understood and
> therefore
> the VDSO pages are copied first into the process, then switching to
> request
> mode and copying the pages which are requested via userfaultfd. To
> decide at which point the VDSO pages can be copied into the process, the
> lazy-pages server is currently waiting for the first page requested via
> userfaultfd. This is one of the VDSO pages. To not copy a page a second
> time, which is unnecessary and not possible, there is now a check to see
> if the page has been transferred previously.
> 
> The use case to use usefaultfd with a checkpointed process on a remote
> machine will probably benefit from the current work related to
> image-cache and image-proxy.
> 
> For the final implementation it would be nice to have a restore running
> in uffd mode on one system which requests the memory pages over the
> network from another system which is running 'criu checkpoint' also in
> uffd mode. This way the pages need to be copied only 'once' from the
> checkpoint process to the uffd restore process.
> 
> TODO:
>     * Contains still many debug outputs which need to be cleaned up.
>     * Maybe transfer the dump directory FD also via unix domain sockets
>       so that the 'uffd'/'lazy-pages' server can keep running without
>       the need to specify the dump directory with '-D'
>     * Keep the lazy-pages server running after all pages have been
>       transferred and start waiting for new connections to serve.
>     * Resurrect the non-cooperative patch set, as once the restored task
>       fork()'s or calls mremap() the whole thing becomes broken.
>     * Figure out if current VDSO handling is correct.
>     * Figure out when and how zero pages need to be inserted via uffd.
> 
> v2:
>     * provide option '--lazy-pages' to enable uffd style restore
>     * use send_fd()/recv_fd() provided by criu (instead of own
>       implementation)
>     * do not install the uffd as service_fd
>     * use named constants for MAP_ANONYMOUS
>     * do not restore memory pages and then later mark them as uffd
>       handled
>     * remove function find_pages() to search in pages-<id>.img;
>       now using criu functions to find the necessary pages;
>       for each new page search the pages-<id>.img file is opened
>     * only check the UFFDIO_API once
>     * trying to protect uffd code by CONFIG_UFFD;
>       use make UFFD=1 to compile criu with this patch
> 
> v3:
>    * renamed the server mode from 'uffd' -> 'lazy-pages'
>    * switched client and server roles transferring the UFFD FD
>      * the criu part running in lazy-pages server mode is now
>        waiting for connections
>      * the criu restore process connects to the lazy-pages server
>        to pass the UFFD FD
>    * before UFFD copying anything else the VDSO pages are copied
>      as it fails to copy unused VDSO pages once the process is running.
>      this was necessary to be able to copy all pages.
>    * if there are no more UFFD messages for 5 seconds the lazy-pages
>      server switches in copy mode to copy all remaining pages, which
>      have not been requested yet, into the restored process
>    * check the UFFDIO_API at the correct place
>    * close UFFD FD in the restorer to remove open UFFD FD in the
>      restored process
> 
> v4:
>     * removed unnecessary madvise() calls ; it seemed necessary when
>       first running tests with uffd; it actually is not necessary
>     * auto-detect if build-system provides linux/userfaultfd.h
>       header.
>     * simplify unix domain socket setup and communication.
>     * use --address to specify the location of the used
>       unix domain socket.
> 
> v5:
>     * split the userfaultfd patch in multiple smaller patches
>     * introduced vma_can_be_lazy() function to check if a page
>       can be handled by uffd
>     * moved uffd related code from cr-restore.c to uffd.c
>     * handle failure to register a memory page of the restored process
>       with userfaultfd
> 
> v6:
>     * get PID of to be restored process from the 'criu restore' process;
>       first the PID is transferred and then the UFFD
> 
> Signed-off-by: Adrian Reber <areber at redhat.com>
> ---
>  criu/cr-restore.c         | 21 +++++++++++
>  criu/crtools.c            | 14 ++++++++
>  criu/include/cr_options.h |  1 +
>  criu/include/restorer.h   |  2 ++
>  criu/include/uffd.h       |  6 ++++
>  criu/include/vma.h        |  9 +++++
>  criu/pie/restorer.c       | 77 ++++++++++++++++++++++++++++++++++++++---
>  criu/uffd.c               | 88 +++++++++++++++++++++++++++++++++++++++++++++--
>  8 files changed, 210 insertions(+), 8 deletions(-)
> 
> diff --git a/criu/cr-restore.c b/criu/cr-restore.c
> index 9950fa1..2299e95 100644
> --- a/criu/cr-restore.c
> +++ b/criu/cr-restore.c
> @@ -19,6 +19,7 @@
>  #include <sys/shm.h>
>  #include <sys/mount.h>
>  #include <sys/prctl.h>
> +#include <sys/syscall.h>
>  
>  #include <sched.h>
>  
> @@ -76,6 +77,8 @@
>  #include "seccomp.h"
>  #include "bitmap.h"
>  #include "fault-injection.h"
> +#include "uffd.h"
> +
>  #include "parasite-syscall.h"
>  
>  #include "protobuf.h"
> @@ -415,6 +418,7 @@ static int restore_priv_vma_content(void)
>  	unsigned int nr_shared = 0;
>  	unsigned int nr_droped = 0;
>  	unsigned int nr_compared = 0;
> +	unsigned int nr_lazy = 0;
>  	unsigned long va;
>  	struct page_read pr;
>  
> @@ -469,6 +473,17 @@ static int restore_priv_vma_content(void)
>  			p = decode_pointer((off) * PAGE_SIZE +
>  					vma->premmaped_addr);
>  
> +			/*
> +			 * This means that userfaultfd is used to load the pages
> +			 * on demand.
> +			 */
> +			if (opts.lazy_pages && vma_entry_can_be_lazy(vma->e)) {
> +				pr_debug("Lazy restore skips %lx\n", vma->e->start);
> +				pr.skip_pages(&pr, PAGE_SIZE);
> +				nr_lazy++;
> +				continue;
> +			}
> +
>  			set_bit(off, vma->page_bitmap);
>  			if (vma->ppage_bitmap) { /* inherited vma */
>  				clear_bit(off, vma->ppage_bitmap);
> @@ -557,6 +572,7 @@ err_read:
>  	pr_info("nr_restored_pages: %d\n", nr_restored);
>  	pr_info("nr_shared_pages:   %d\n", nr_shared);
>  	pr_info("nr_droped_pages:   %d\n", nr_droped);
> +	pr_info("nr_lazy:           %d\n", nr_lazy);
>  
>  	return 0;
>  
> @@ -3231,6 +3247,11 @@ static int sigreturn_restore(pid_t pid, CoreEntry *core)
>  
>  	strncpy(task_args->comm, core->tc->comm, sizeof(task_args->comm));
>  
> +	if (!opts.lazy_pages)
> +		task_args->uffd = -1;
> +	else
> +		if (setup_uffd(task_args, pid) != 0)
> +			goto err;
>  
>  	/*
>  	 * Fill up per-thread data.
> diff --git a/criu/crtools.c b/criu/crtools.c
> index 763b5fd..2e85cb8 100644
> --- a/criu/crtools.c
> +++ b/criu/crtools.c
> @@ -319,6 +319,9 @@ int main(int argc, char *argv[], char *envp[])
>  		{ "external",			required_argument,	0, 1073	},
>  		{ "empty-ns",			required_argument,	0, 1074	},
>  		{ "unshare",			required_argument,	0, 1075 },
> +#ifdef CONFIG_HAS_UFFD
> +		{ "lazy-pages",			no_argument,		0, 1076 },
> +#endif
>  		{ },
>  	};
>  
> @@ -572,6 +575,11 @@ int main(int argc, char *argv[], char *envp[])
>  			if (parse_unshare_arg(optarg))
>  				return -1;
>  			break;
> +#ifdef CONFIG_HAS_UFFD
> +		case 1076:
> +			opts.lazy_pages = true;
> +			break;
> +#endif
>  		case 'M':
>  			{
>  				char *aux;
> @@ -813,6 +821,12 @@ usage:
>  "  --unshare FLAGS       what namespaces to unshare when restoring\n"
>  "  --freeze-cgroup\n"
>  "                        use cgroup freezer to collect processes\n"
> +#ifdef CONFIG_HAS_UFFD
> +"  --lazy-pages          restore pages on demand\n"
> +"                        this requires running a second instance of criu\n"
> +"                        in lazy-pages mode: 'criu lazy-pages -D DIR'\n"
> +"                        --lazy-pages and lazy-pages mode require userfaultfd\n"
> +#endif
>  "\n"
>  "* Special resources support:\n"
>  "  -x|--" USK_EXT_PARAM "inode,.." "      allow external unix connections (optionally can be assign socket's inode that allows one-sided dump)\n"
> diff --git a/criu/include/cr_options.h b/criu/include/cr_options.h
> index c1e8fbc..918bb4d 100644
> --- a/criu/include/cr_options.h
> +++ b/criu/include/cr_options.h
> @@ -108,6 +108,7 @@ struct cr_options {
>  	char			*lsm_profile;
>  	unsigned int		timeout;
>  	unsigned int		empty_ns;
> +	bool			lazy_pages;
>  };
>  
>  extern struct cr_options opts;
> diff --git a/criu/include/restorer.h b/criu/include/restorer.h
> index 9896aa1..86236b0 100644
> --- a/criu/include/restorer.h
> +++ b/criu/include/restorer.h
> @@ -124,6 +124,8 @@ struct task_restore_args {
>  	int				logfd;
>  	unsigned int			loglevel;
>  
> +	int				uffd;
> +
>  	/* threads restoration */
>  	int				nr_threads;		/* number of threads */
>  	thread_restore_fcall_t		clone_restore_fn;	/* helper address for clone() call */
> diff --git a/criu/include/uffd.h b/criu/include/uffd.h
> index d5a043b..6c931e2 100644
> --- a/criu/include/uffd.h
> +++ b/criu/include/uffd.h
> @@ -2,6 +2,7 @@
>  #define __CR_UFFD_H_
>  
>  #include "config.h"
> +#include "restorer.h"
>  
>  #ifdef CONFIG_HAS_UFFD
>  
> @@ -11,6 +12,11 @@
>  #ifndef __NR_userfaultfd
>  #error "missing __NR_userfaultfd definition"
>  #endif
> +
> +extern int setup_uffd(struct task_restore_args *task_args, int pid);
> +#else
> +static inline int setup_uffd(struct task_restore_args *task_args, int pid) { return 0; }
> +
>  #endif /* CONFIG_HAS_UFFD */
>  
>  #endif /* __CR_UFFD_H_ */
> diff --git a/criu/include/vma.h b/criu/include/vma.h
> index 247c5a3..28b77b5 100644
> --- a/criu/include/vma.h
> +++ b/criu/include/vma.h
> @@ -7,6 +7,8 @@
>  
>  #include "images/vma.pb-c.h"
>  
> +#include <sys/mman.h>
> +
>  struct vm_area_list {
>  	struct list_head	h;
>  	unsigned		nr;
> @@ -107,4 +109,11 @@ static inline bool vma_area_is_private(struct vma_area *vma,
>  	return vma_entry_is_private(vma->e, task_size);
>  }
>  
> +static inline bool vma_entry_can_be_lazy(VmaEntry *e)
> +{
> +	return ((e->flags & MAP_ANONYMOUS) &&
> +		(e->flags & MAP_PRIVATE) &&
> +		!(vma_entry_is(e, VMA_AREA_VSYSCALL)));
> +}
> +
>  #endif /* __CR_VMA_H__ */
> diff --git a/criu/pie/restorer.c b/criu/pie/restorer.c
> index f7bde75..4a009f4 100644
> --- a/criu/pie/restorer.c
> +++ b/criu/pie/restorer.c
> @@ -25,6 +25,7 @@
>  #include "image.h"
>  #include "sk-inet.h"
>  #include "vma.h"
> +#include "uffd.h"
>  
>  #include "crtools.h"
>  #include "lock.h"
> @@ -567,8 +568,50 @@ static void rst_tcp_socks_all(struct task_restore_args *ta)
>  		rst_tcp_repair_off(&ta->tcp_socks[i]);
>  }
>  
> -static int vma_remap(unsigned long src, unsigned long dst, unsigned long len)
> +
> +
> +
> +static int enable_uffd(int uffd, unsigned long addr, unsigned long len)
>  {
> +	/*
> +	 * If uffd == -1, this means that userfaultfd is not enabled
> +	 * or it is not available.
> +	 */
> +	if (uffd == -1)
> +		return 0;
> +#ifdef CONFIG_HAS_UFFD
> +	int rc;
> +	struct uffdio_register uffdio_register;
> +	unsigned long expected_ioctls;
> +
> +	uffdio_register.range.start = addr;
> +	uffdio_register.range.len = len;
> +	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
> +	pr_info("lazy-pages: uffdio_register.range.start 0x%lx\n", (unsigned long) uffdio_register.range.start);
> +	pr_info("lazy-pages: uffdio_register.len 0x%llx\n", uffdio_register.range.len);
> +	rc = sys_ioctl(uffd, UFFDIO_REGISTER, &uffdio_register);
> +	pr_info("lazy-pages: ioctl UFFDIO_REGISTER rc %d\n", rc);
> +	pr_info("lazy-pages: uffdio_register.range.start 0x%lx\n", (unsigned long) uffdio_register.range.start);
> +	pr_info("lazy-pages: uffdio_register.len 0x%llx\n", uffdio_register.range.len);
> +	if (rc != 0)
> +		return -1;
> +
> +	expected_ioctls = (1 << _UFFDIO_WAKE) | (1 << _UFFDIO_COPY) | (1 << _UFFDIO_ZEROPAGE);
> +
> +	if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) {
> +		pr_err("lazy-pages: unexpected missing uffd ioctl for anon memory\n");
> +	}
> +
> +#endif
> +	return 0;
> +}
> +
> +
> +static int vma_remap(VmaEntry *vma_entry, int uffd)
> +{
> +	unsigned long src = vma_premmaped_start(vma_entry);
> +	unsigned long dst = vma_entry->start;
> +	unsigned long len = vma_entry_len(vma_entry);
>  	unsigned long guard = 0, tmp;
>  
>  	pr_info("Remap %lx->%lx len %lx\n", src, dst, len);
> @@ -640,6 +683,18 @@ static int vma_remap(unsigned long src, unsigned long dst, unsigned long len)
>  		return -1;
>  	}
>  
> +	/*
> +	 * If running in userfaultfd/lazy-pages mode pages with
> +	 * MAP_ANONYMOUS and MAP_PRIVATE are remapped but without the
> +	 * real content.
> +	 * The function enable_uffd() marks the page(s) as userfaultfd
> +	 * pages, so that the processes will hang until the memory is
> +	 * injected via userfaultfd.
> +	 */
> +	if (vma_entry_can_be_lazy(vma_entry))
> +		if (enable_uffd(uffd, dst, len) != 0)
> +			return -1;
> +
>  	return 0;
>  }
>  
> @@ -898,6 +953,10 @@ long __export_restore_task(struct task_restore_args *args)
>  
>  	pr_info("Switched to the restorer %d\n", my_pid);
>  
> +	if (args->uffd > -1) {
> +		pr_debug("lazy-pages: uffd %d\n", args->uffd);
> +	}
> +
>  	if (vdso_do_park(&args->vdso_sym_rt, args->vdso_rt_parked_at, vdso_rt_size))
>  		goto core_restore_end;
>  
> @@ -918,8 +977,7 @@ long __export_restore_task(struct task_restore_args *args)
>  		if (vma_entry->start > vma_entry->shmid)
>  			break;
>  
> -		if (vma_remap(vma_premmaped_start(vma_entry),
> -				vma_entry->start, vma_entry_len(vma_entry)))
> +		if (vma_remap(vma_entry, args->uffd))
>  			goto core_restore_end;
>  	}
>  
> @@ -936,11 +994,20 @@ long __export_restore_task(struct task_restore_args *args)
>  		if (vma_entry->start < vma_entry->shmid)
>  			break;
>  
> -		if (vma_remap(vma_premmaped_start(vma_entry),
> -				vma_entry->start, vma_entry_len(vma_entry)))
> +		if (vma_remap(vma_entry, args->uffd))
>  			goto core_restore_end;
>  	}
>  
> +	if (args->uffd > -1) {
> +		pr_debug("lazy-pages: closing uffd %d\n", args->uffd);
> +		/*
> +		 * All userfaultfd configuration has finished at this point.
> +		 * Let's close the UFFD file descriptor, so that the restored
> +		 * process does not have an opened UFFD FD for ever.
> +		 */
> +		sys_close(args->uffd);
> +	}
> +
>  	/*
>  	 * OK, lets try to map new one.
>  	 */
> diff --git a/criu/uffd.c b/criu/uffd.c
> index 6d8b286..3beae79 100644
> --- a/criu/uffd.c
> +++ b/criu/uffd.c
> @@ -30,6 +30,90 @@
>  #undef  LOG_PREFIX
>  #define LOG_PREFIX "lazy-pages: "
>  
> +static int send_uffd(int sendfd, int pid)
> +{
> +	int fd;
> +	int len;
> +	int ret = -1;
> +	struct sockaddr_un sun;
> +
> +	if (!opts.addr) {
> +		pr_info("Please specify a file name for the unix domain socket\n");
> +		pr_info("used to communicate between the lazy-pages server\n");
> +		pr_info("and the restore process. Use the --address option like\n");
> +		pr_info("criu restore --lazy-pages --address /tmp/userfault.socket\n");
> +		return -1;
> +	}
> +
> +	if (sendfd < 0)
> +		return -1;
> +
> +	if (strlen(opts.addr) >= sizeof(sun.sun_path)) {
> +		return -1;
> +	}
> +
> +	if ((fd = socket(AF_UNIX, SOCK_STREAM, 0)) < 0)
> +		return -1;
> +
> +	memset(&sun, 0, sizeof(sun));
> +	sun.sun_family = AF_UNIX;
> +	strcpy(sun.sun_path, opts.addr);
> +	len = offsetof(struct sockaddr_un, sun_path) + strlen(opts.addr);
> +	if (connect(fd, (struct sockaddr *) &sun, len) < 0) {
> +		pr_perror("connect to %s failed", opts.addr);
> +		goto out;
> +	}
> +
> +	/* The "transfer protocol" is first the pid as int and then
> +	 * the FD for UFFD */
> +	pr_debug("Sending PID %d\n", pid);
> +	if (send(fd, &pid, sizeof(pid), 0) < 0) {
> +		pr_perror("PID sending error:");
> +		goto out;
> +	}
> +
> +	if (send_fd(fd, NULL, 0, sendfd) < 0) {
> +		pr_perror("send_fd error:");
> +		goto out;
> +	}
> +	ret = 0;
> +out:
> +	close(fd);
> +	return ret;
> +}
> +
> +/* This function is used by 'criu restore --lazy-pages' */
> +int setup_uffd(struct task_restore_args *task_args, int pid)
> +{
> +	struct uffdio_api uffdio_api;
> +	/*
> +	 * Open userfaulfd FD which is passed to the restorer blob and
> +	 * to a second process handling the userfaultfd page faults.
> +	 */
> +	task_args->uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> +
> +	/*
> +	 * Check if the UFFD_API is the one which is expected
> +	 */
> +	uffdio_api.api = UFFD_API;
> +	uffdio_api.features = 0;
> +	if (ioctl(task_args->uffd, UFFDIO_API, &uffdio_api)) {
> +		pr_err("Checking for UFFDIO_API failed.\n");
> +		return -1;
> +	}
> +	if (uffdio_api.api != UFFD_API) {
> +		pr_err("Result of looking up UFFDIO_API does not match: %Lu\n", uffdio_api.api);
> +		return -1;
> +	}
> +
> +	if (send_uffd(task_args->uffd, pid) < 0) {
> +		close(task_args->uffd);
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
>  static int server_listen(struct sockaddr_un *saddr)
>  {
>  	int fd;
> @@ -232,9 +316,7 @@ static int collect_uffd_pages(struct page_read *pr, struct list_head *uffd_list,
>  			 * in the VMA list.
>  			 */
>  			if (base >= vma->e->start && base < vma->e->end) {
> -				if ((vma->e->flags & MAP_ANONYMOUS) &&
> -				    (vma->e->flags & MAP_PRIVATE) &&
> -				    !(vma_area_is(vma, VMA_AREA_VSYSCALL))) {
> +				if (vma_entry_can_be_lazy(vma->e)) {
>  					uffd_page = true;
>  					if (vma_area_is(vma, VMA_AREA_VDSO))
>  						uffd_vdso = true;
> -- 
> 1.8.3.1
> 
> _______________________________________________
> CRIU mailing list
> CRIU at openvz.org
> https://lists.openvz.org/mailman/listinfo/criu


More information about the CRIU mailing list