[CRIU] [RFC PATCH] Try to include userfaultfd with criu

Pavel Emelyanov xemul at parallels.com
Thu Sep 24 08:44:16 PDT 2015


On 09/24/2015 04:04 PM, adrian at lisas.de wrote:
> From: Adrian Reber <areber at redhat.com>
> 
> This is a first try to include userfaultfd with criu. Right now it
> still requires a "normal" checkpoint. After checkpointing the application
> it can be restored with the help of userfaultfd.

Thanks for looking into this :) Please, see my comments inline.

> The normal restore still copies all pages from the checkpoint to
> the right location (this still needs to be disabled) and the usage
> of userfaultfd is right now hardcoded in this patch.
> 
> All restored pages with MAP_ANONYMOUS set are marked as being handled by
> userfaultfd and also madvise()'d as MADV_DONTNEED.
> 
> If I also enable userfaultfd for pages without MAP_ANONYMOUS the restored process
> segfaults. I have not looked into more details into why it segfaults.

Hm... IIRC the uffd from Andrea only worked on anonymous private areas, so 
when you turned one one (and didn't check the ret code from register ioctl ;)
the respective vma just wasn't restored at all.

> As soon as the process is restored it blocks on the first memory access
> and waits for pages being transferred by userfaultfd.
> 
> To handle the required pages a new criu command has been added. The restore
> works now like this:
> 
>   criu restore -D /tmp/3 -j -v4
> 
> This hangs after the restored process is running and needs:
> 
>   criu uffd -v4 -D /tmp/3/

So the plan for uffd action is to start a daemon that would read
pages from images and put them into some uffd descriptor, right?

> This waits on a UFFD FD which has been passed by the 'criu restore' process over
> unix domain sockets for UFFD requests. For my current test programm following
> pages are transmitted over UFFD:
> 

[snip]

> TODO:
>  * Provide parameter for restore to not always use UFFD like described in the wiki:
>    http://criu.org/Userfaultfd : --lazy-pages
> 
>  * Do not restore pages which will be transferred via UFFD.
>    They still need to be mapped.
> 
>  * Unix domain sockets with credentials.

* Pick up the non-cooperative uffd patches, as real apps may fork() and mremap()
memory and this would screw things up.

Sigh... I will try to find time to rebase the patches onto recent tree and resend
them, but if you will get blocked by this, let me know :)

> @@ -2650,6 +2654,131 @@ out:
>  extern void __gcov_flush(void) __attribute__((weak));
>  void __gcov_flush(void) {}
>  
[snip]
> +out:
> +	close(client);
> +	free(name);
> +	return (ret);
> +}
> +
> +static int send_uffd(int fd, int send_fd)

We already have this code in criu :) The pie/util-fd.c
provides send_fds and recv_fds and in include/util-pie.h
there are helpers to send/recv a single FD using those.

> +{
> +	struct iovec iov[1];
> +	struct msghdr msg;
> +	char buf[2];
> +	struct cmsghdr *cmptr = NULL;
> +

> @@ -2931,6 +3064,31 @@ static int sigreturn_restore(pid_t pid, CoreEntry *core)
>  
>  	strncpy(task_args->comm, core->tc->comm, sizeof(task_args->comm));
>  
> +	/*
> +	 * open userfaulfd FD which is to the restorer blob and
> +	 * to a second process handling the userfaultfd page faults
> +	 */
> +	uffd = syscall(__NR_userfaultfd, O_CLOEXEC);
> +	pr_info("uffd %d\n", uffd);
> +
> +	if ((listen = server_listen()) < 0) {
> +		pr_info("server_listen error");
> +	}
> +
> +	/* accept new client request */
> +	if ((client = server_accept(listen)) < 0) {
> +		pr_info("server_accept error: %d", client);
> +	}
> +
> +	send_uffd(client, uffd);
> +
> +	close(listen);
> +	close(client);
> +
> +	install_service_fd(UFFD_FD_OFF, uffd);

It looks like the service descriptor is not required for uffd. Service
fds are those used all over the code and that needs to be put into some
"known" location in the code. The uffd one is only passed into restorer.

> +	close(uffd);
> +
> +	task_args->uffd = get_service_fd(UFFD_FD_OFF);
>  
>  	/*
>  	 * Fill up per-thread data.

> @@ -606,6 +639,8 @@ static int vma_remap(unsigned long src, unsigned long dst, unsigned long len)
>  		pr_err("Unable to remap %lx -> %lx\n", src, dst);
>  		return -1;
>  	}
> +	if (flags & 0x20)

Can you put named constants here instead of 0x20?

> +		enable_uffd(uffd, dst, len);
>  
>  	return 0;
>  }

> +
> +static int ud_open()
> +{
> +	int udfd;
> +	int newfd;
> +
> +	if ((udfd = client_conn()) < 0) {
> +		pr_err("unix domain socket connection error");
> +		return (-1);
> +	}
> +
> +	newfd = recv_fd(udfd);
> +	close(udfd);
> +
> +	return newfd;
> +}
> +
> +
> +static unsigned long find_page(unsigned long addr)

There's a seek_pagemap_page() call for this type of operation. One problem with
it is that it only works sequentially, i.e. when subsequent calls to it pass
constantly increasing addresses.

I'd suggest fixing this seek_pagemap_page() to overcome this limitation and use
one here. This would also be much faster than readdir() and open_page_read() for
every single page.

> +{
> +	struct dirent *ent;
> +	DIR *dirp = fdopendir(dup(criu_get_image_dir()));
> +	struct page_read pr;
> +	struct iovec iov;
> +	int pid;
> +	int ret = -7;
> +	unsigned long offset = 0;
> +
> +	while (1) {
> +		ent = readdir(dirp);
> +		if (ent == NULL) {
> +			if (errno) {
> +				pr_perror("Failed readdir, error=%d", errno);
> +				offset = -1;
> +				goto exit;
> +			}

> +int uffd_listen()
> +{
> +	struct uffd_msg msg;
> +	unsigned long offset = 0;
> +	int rc;
> +	int uffd;
> +	int uffd_flags;
> +	struct uffdio_api uffdio_api;
> +	struct uffdio_copy uffdio_copy;
> +	unsigned long ps;
> +	int pages;
> +	void *mapped_pages;
> +	__u64 flags;
> +	__u64 address;
> +
> +	if ((uffd = ud_open()) < 0)
> +		exit(0);
> +
> +	pr_info("uffd %d\n", uffd);
> +	uffd_flags = fcntl(uffd, F_GETFD, NULL);
> +	pr_info("uffd_flags 0x%x\n", uffd_flags);
> +	pr_info("UFFD_API 0x%llx\n", UFFD_API);
> +
> +	uffdio_api.api = UFFD_API;
> +	uffdio_api.features = 0;
> +
> +	if (ioctl(uffd, UFFDIO_API, &uffdio_api)) {

Why do you call UFFDIO_API for the 2nd time here? It should have been
called by restorer already.

> +		pr_err("UFFDIO_API ioctl() failed\n");
> +		return 1;
> +	}
> +
> +	if (uffdio_api.api != UFFD_API) {
> +		pr_err("UFFDIO_API error %Lu\n", uffdio_api.api);
> +		return 1;
> +	}
> +
> +	ps = page_size();
> +
> +	while (1) {

-- Pavel



More information about the CRIU mailing list