[CRIU] [RFC PATCH] Try to include userfaultfd with criu

Adrian Reber adrian at lisas.de
Mon Nov 9 05:31:54 PST 2015


On Thu, Sep 24, 2015 at 06:44:16PM +0300, Pavel Emelyanov wrote:
> On 09/24/2015 04:04 PM, adrian at lisas.de wrote:
> > From: Adrian Reber <areber at redhat.com>
> > 
> > This is a first try to include userfaultfd with criu. Right now it
> > still requires a "normal" checkpoint. After checkpointing the application
> > it can be restored with the help of userfaultfd.
> 
> Thanks for looking into this :) Please, see my comments inline.

Thanks for your comments. Took me some time to look at uffd code again.

> > The normal restore still copies all pages from the checkpoint to
> > the right location (this still needs to be disabled) and the usage
> > of userfaultfd is right now hardcoded in this patch.
> > 
> > All restored pages with MAP_ANONYMOUS set are marked as being handled by
> > userfaultfd and also madvise()'d as MADV_DONTNEED.
> > 
> > If I also enable userfaultfd for pages without MAP_ANONYMOUS the restored process
> > segfaults. I have not looked into more details into why it segfaults.
> 
> Hm... IIRC the uffd from Andrea only worked on anonymous private areas, so 
> when you turned one one (and didn't check the ret code from register ioctl ;)
> the respective vma just wasn't restored at all.

Yes, I think I also understood it now.

> > As soon as the process is restored it blocks on the first memory access
> > and waits for pages being transferred by userfaultfd.
> > 
> > To handle the required pages a new criu command has been added. The restore
> > works now like this:
> > 
> >   criu restore -D /tmp/3 -j -v4
> > 
> > This hangs after the restored process is running and needs:
> > 
> >   criu uffd -v4 -D /tmp/3/
> 
> So the plan for uffd action is to start a daemon that would read
> pages from images and put them into some uffd descriptor, right?

Yes, I have the normal criu restore process running now like this:

$ ./criu restore -D /tmp/3 -j -v4 --lazy-page

and it hangs on the uffd file descriptor until criu uffd daemon is
running:

$ ./criu uffd -v4 -D /tmp/3/

> > This waits on a UFFD FD which has been passed by the 'criu restore' process over
> > unix domain sockets for UFFD requests. For my current test programm following
> > pages are transmitted over UFFD:
> > 
> 
> [snip]
> 
> > TODO:
> >  * Provide parameter for restore to not always use UFFD like described in the wiki:
> >    http://criu.org/Userfaultfd : --lazy-pages
> > 
> >  * Do not restore pages which will be transferred via UFFD.
> >    They still need to be mapped.
> > 
> >  * Unix domain sockets with credentials.
> 
> * Pick up the non-cooperative uffd patches, as real apps may fork() and mremap()
> memory and this would screw things up.
> 
> Sigh... I will try to find time to rebase the patches onto recent tree and resend
> them, but if you will get blocked by this, let me know :)

Not necessary yet...

> > @@ -2650,6 +2654,131 @@ out:
> >  extern void __gcov_flush(void) __attribute__((weak));
> >  void __gcov_flush(void) {}
> >  
> [snip]
> > +out:
> > +	close(client);
> > +	free(name);
> > +	return (ret);
> > +}
> > +
> > +static int send_uffd(int fd, int send_fd)
> 
> We already have this code in criu :) The pie/util-fd.c
> provides send_fds and recv_fds and in include/util-pie.h
> there are helpers to send/recv a single FD using those.

I removed my own code handling the FD transfer and I am now using the
criu FD code.

> > +{
> > +	struct iovec iov[1];
> > +	struct msghdr msg;
> > +	char buf[2];
> > +	struct cmsghdr *cmptr = NULL;
> > +
> 
> > @@ -2931,6 +3064,31 @@ static int sigreturn_restore(pid_t pid, CoreEntry *core)
> >  
> >  	strncpy(task_args->comm, core->tc->comm, sizeof(task_args->comm));
> >  
> > +	/*
> > +	 * open userfaulfd FD which is to the restorer blob and
> > +	 * to a second process handling the userfaultfd page faults
> > +	 */
> > +	uffd = syscall(__NR_userfaultfd, O_CLOEXEC);
> > +	pr_info("uffd %d\n", uffd);
> > +
> > +	if ((listen = server_listen()) < 0) {
> > +		pr_info("server_listen error");
> > +	}
> > +
> > +	/* accept new client request */
> > +	if ((client = server_accept(listen)) < 0) {
> > +		pr_info("server_accept error: %d", client);
> > +	}
> > +
> > +	send_uffd(client, uffd);
> > +
> > +	close(listen);
> > +	close(client);
> > +
> > +	install_service_fd(UFFD_FD_OFF, uffd);
> 
> It looks like the service descriptor is not required for uffd. Service
> fds are those used all over the code and that needs to be put into some
> "known" location in the code. The uffd one is only passed into restorer.

Okay, I am no longer installing UFFD as a service descriptor and it
still works.

> > +	close(uffd);
> > +
> > +	task_args->uffd = get_service_fd(UFFD_FD_OFF);
> >  
> >  	/*
> >  	 * Fill up per-thread data.
> 
> > @@ -606,6 +639,8 @@ static int vma_remap(unsigned long src, unsigned long dst, unsigned long len)
> >  		pr_err("Unable to remap %lx -> %lx\n", src, dst);
> >  		return -1;
> >  	}
> > +	if (flags & 0x20)
> 
> Can you put named constants here instead of 0x20?

Done.

> > +		enable_uffd(uffd, dst, len);
> >  
> >  	return 0;
> >  }
> 
> > +
> > +static int ud_open()
> > +{
> > +	int udfd;
> > +	int newfd;
> > +
> > +	if ((udfd = client_conn()) < 0) {
> > +		pr_err("unix domain socket connection error");
> > +		return (-1);
> > +	}
> > +
> > +	newfd = recv_fd(udfd);
> > +	close(udfd);
> > +
> > +	return newfd;
> > +}
> > +
> > +
> > +static unsigned long find_page(unsigned long addr)
> 
> There's a seek_pagemap_page() call for this type of operation. One problem with
> it is that it only works sequentially, i.e. when subsequent calls to it pass
> constantly increasing addresses.
> 
> I'd suggest fixing this seek_pagemap_page() to overcome this limitation and use
> one here. This would also be much faster than readdir() and open_page_read() for
> every single page.

I think I almost have this figured out. I am now doing:

pr->get_pagemap(pr, &iov);
seek_pagemap_page(pr, addr, true);
pr->read_page(pr, addr, buf);

As this only works for the first page I introduced a rewind_pagemap()
function with:

lseek(img_raw_fd(pr->pi), 0, SEEK_SET);

Unfortunately the next call to get_pagemap() now returns 0 instead of
the usual 1. So somehow I am missing something which is needed to rewind
the pagemap reading. Any idea how to do this better/correctly?

I can also post my current uffd patch if that makes discussion easier...

> > +{
> > +	struct dirent *ent;
> > +	DIR *dirp = fdopendir(dup(criu_get_image_dir()));
> > +	struct page_read pr;
> > +	struct iovec iov;
> > +	int pid;
> > +	int ret = -7;
> > +	unsigned long offset = 0;
> > +
> > +	while (1) {
> > +		ent = readdir(dirp);
> > +		if (ent == NULL) {
> > +			if (errno) {
> > +				pr_perror("Failed readdir, error=%d", errno);
> > +				offset = -1;
> > +				goto exit;
> > +			}
> 
> > +int uffd_listen()
> > +{
> > +	struct uffd_msg msg;
> > +	unsigned long offset = 0;
> > +	int rc;
> > +	int uffd;
> > +	int uffd_flags;
> > +	struct uffdio_api uffdio_api;
> > +	struct uffdio_copy uffdio_copy;
> > +	unsigned long ps;
> > +	int pages;
> > +	void *mapped_pages;
> > +	__u64 flags;
> > +	__u64 address;
> > +
> > +	if ((uffd = ud_open()) < 0)
> > +		exit(0);
> > +
> > +	pr_info("uffd %d\n", uffd);
> > +	uffd_flags = fcntl(uffd, F_GETFD, NULL);
> > +	pr_info("uffd_flags 0x%x\n", uffd_flags);
> > +	pr_info("UFFD_API 0x%llx\n", UFFD_API);
> > +
> > +	uffdio_api.api = UFFD_API;
> > +	uffdio_api.features = 0;
> > +
> > +	if (ioctl(uffd, UFFDIO_API, &uffdio_api)) {
> 
> Why do you call UFFDIO_API for the 2nd time here? It should have been
> called by restorer already.

Need to look into this.

> > +		pr_err("UFFDIO_API ioctl() failed\n");
> > +		return 1;
> > +	}
> > +
> > +	if (uffdio_api.api != UFFD_API) {
> > +		pr_err("UFFDIO_API error %Lu\n", uffdio_api.api);
> > +		return 1;
> > +	}
> > +
> > +	ps = page_size();
> > +
> > +	while (1) {

		Adrian


More information about the CRIU mailing list