[CRIU] Remote lazy-restore design discussion

Mike Rapoport mike.rapoport at gmail.com
Mon Apr 4 09:20:33 PDT 2016


On Mon, Apr 04, 2016 at 04:06:50PM +0300, Pavel Emelyanov wrote:
> On 03/31/2016 05:25 PM, Adrian Reber wrote:
> > Hello Pavel,
> > 
> > after Mike asked if there have been any design discussions and after I
> > am not 100% sure how the page-server fits into the remote restore, it
> > seems to be a good idea to get a common understanding what the right
> > implementation for remote lazy-restore should look like.
> > 
> > I am using my implementation as a starting point for the discussion.
> > 
> > I think we need three different process for remote lazy restore
> > independent of how they are started. 'destination system' is the system
> > the process should be migrated to and 'source system' is the system the
> > original process was running on before the migration.
> > 
> >  1. The actual restore process (destination system):
> >      This is a 'normal' restore with the difference that memory pages
> >      (MAP_ANONYMOUS and MAP_PRIVATE) are not copied to their place
> >      but they are marked as being handled by userfaultfd. Therefore
> >      a userfaultfd FD (UFFD) is opened and passed to a second process.
> > 
> >  2. The local lazy restore UFFD handler (destination system):
> >      This process listens on the UFFD for userfault requests and tries to
> >      handle the userfault requests. Either by reading the required pages
> >      from a local checkpoint (rather unlikely use case) or it is requesting
> >      the pages from a remote system (source system) via the network.
> > 
> >  3. The remote lazy restore page request handler (source system):
> >      This process opens a network port and listens for page requests
> >      and reads the requested pages from a local checkpoint (or even
> >      better, directly from a stopped process).
> 
> Agreed. And the process #1 would eventually turn into the restored process(es).
> 
> I would also add that process 3 should not only listen for page requests, but
> also send other pages in the background. Probably the ideal process 3 should
> 
> 1. Have a queue of pages to be sent (struct page_server_iov-s)
> 2. Fill it with pages that were not transfered (ANON|PRIVATE)
> 3. Start sending them one by one
> 4. Receive messages from the process #3 that can move some items from
>    the queue on top (i.e. -- the pages that are needed right now)

Well, I actually thought more about "pull" than "push" approach. The pages
are anyway collected into pagemap.img and it may be shared between source
and destination. Than page-read on the restore (destination) side works
almost as now, just instead of read(fd, ...) it does recv(sock, ...).
I have some ugly POC (below) that kinda demonstrates the idea.

If I understand you idea correctly, the dump side requires addition of
beckground process that will handle random pages requests. Except that, it
will work as described at disk-less migration page on criu wiki.
The restore side, however, should be able to request faulting random pages
from the remote and also take care of incoming stream of pages from private
anononymous mappings.

> > As this describes the solution I have implemented it all sounds correct
> > to me. In addition to handling request for pages (processes 2. and 3.)
> > both page handlers need to know how to push unrequested pages at some
> > point in time to make sure the migration can finish.
> > 
> > Looking at the page-server it is currently not clear to me how it fits
> > into this scenario. Currently it listens on a network port (like process
> > 3. from above) and writes the received pages to the local disk.
> 
> Not exactly. It redirects pages from socket into particular page_xfer. Right 
> ow the page server process only uses the local xfer which results in pages 
> being written on disk.
> 
> Also, the page server includes page_server_xfer which is used by criu dump
> to send the page, and this thing should be used by process 3.
 
> > To serve as the process mention as process 3. it would need to learn all
> > the functionality as it has currently been implemented.
> 
> You mean the page server should be taught to work with uffd? Well, kinda yes,
> when I was talking about uffd daemon to use page server, I meant, that the
> uffd process (#2 in your classification) should use page server protocol and
> new page_xfer to transfer pages between hosts. And process #3 should use
> standard page_server_xfer to transfer pages onto remote host.

In this case the new page_xfer should be able to put pages directly into
the memory of the process being restored, so it should be aware of current
mappings for the private anononymous VMAs.
My gut feeling says it'll complicate synchronization between dump and
restore sides.
 
> > Instead of receiving pages and writing it to disk it needs to
> > receive page requests and read the from disk to the network.
> 
> Why to disk? For post-copy live migration using disk for images should
> be avoided as much as possible.
> 
> > This sounds like the opposite of what it is currently doing and,
> > from my point of view, it is either a complete separate process,
> > like my implementation, or all the functionality needs to be added.
> > Also the logic to handle unrequested pages does not seem like
> > something which the page-server can currently do or is designed to do.
> > 
> > So, from my point of view, page-server and remote page request handler
> > seem rather different in their functionality (besides being a TCP
> > server). I suppose there are some points I am not seeing so I hope to
> > understand the situation better from the answers to this email. Thanks.
> 
> Probably I was not correct when used the word "page-server". I meant the
> components used by it, but you thought of it as of a process itself :)

If I'd summarize my thoughts, I'd say I still don't see a clear picture of
overall post-copy implementation :)
I'll try to do some more homework and I'll try to somehow put the puzzle
pieces together. 

> -- Pavel

>From 9dcc0588776c2974e698f651211472fcbb6bfc76 Mon Sep 17 00:00:00 2001
From: Mike Rapoport <rppt at linux.vnet.ibm.com>
Date: Mon, 4 Apr 2016 17:50:13 +0300
Subject: [UGLY PATCH] allow fetching pages from remote page-server

---
 criu/cr-restore.c         |   5 +-
 criu/crtools.c            |   4 ++
 criu/include/cr_options.h |   1 +
 criu/include/page-read.h  |   6 ++
 criu/include/page-xfer.h  |   7 ++
 criu/page-read.c          |  49 ++++++++++++--
 criu/page-xfer.c          | 163 +++++++++++++++++++++++++++++++++++++++++++++-
 7 files changed, 227 insertions(+), 8 deletions(-)

diff --git a/criu/cr-restore.c b/criu/cr-restore.c
index 2f51344..9843b46 100644
--- a/criu/cr-restore.c
+++ b/criu/cr-restore.c
@@ -453,10 +453,13 @@ static int restore_priv_vma_content(void)
 	unsigned int nr_lazy = 0;
 	unsigned long va;
 	struct page_read pr;
+	int pr_flags = PR_TASK;
 
 	vma = list_first_entry(vmas, struct vma_area, list);
 
-	ret = open_page_read(current->pid.virt, &pr, PR_TASK);
+	if (opts.use_page_client)
+		pr_flags |= PR_REMOTE;
+	ret = open_page_read(current->pid.virt, &pr, pr_flags);
 	if (ret <= 0)
 		return -1;
 
diff --git a/criu/crtools.c b/criu/crtools.c
index 6785c78..22449ee 100644
--- a/criu/crtools.c
+++ b/criu/crtools.c
@@ -325,6 +325,7 @@ int main(int argc, char *argv[], char *envp[])
 		{ "extra",			no_argument,		0, 1077	},
 		{ "experimental",		no_argument,		0, 1078	},
 		{ "all",			no_argument,		0, 1079	},
+		{ "page-client",		no_argument,		0, 1080	},
 		{ },
 	};
 
@@ -626,6 +627,9 @@ int main(int argc, char *argv[], char *envp[])
 			opts.check_extra_features = true;
 			opts.check_experimental_features = true;
 			break;
+		case 1080:
+			opts.use_page_client = true;
+			break;
 		case 'V':
 			pr_msg("Version: %s\n", CRIU_VERSION);
 			if (strcmp(CRIU_GITID, "0"))
diff --git a/criu/include/cr_options.h b/criu/include/cr_options.h
index 4853bea..07d9d4f 100644
--- a/criu/include/cr_options.h
+++ b/criu/include/cr_options.h
@@ -86,6 +86,7 @@ struct cr_options {
 	struct list_head	external;
 	char			*libdir;
 	bool			use_page_server;
+	bool			use_page_client;
 	unsigned short		port;
 	char			*addr;
 	int			ps_socket;
diff --git a/criu/include/page-read.h b/criu/include/page-read.h
index 3ba1ee9..abeb4c6 100644
--- a/criu/include/page-read.h
+++ b/criu/include/page-read.h
@@ -40,6 +40,8 @@
  * All this is implemented in read_pagemap_page.
  */
 
+struct page_xfer;
+
 struct page_read {
 	/*
 	 * gets next vaddr:len pair to work on.
@@ -57,6 +59,8 @@ struct page_read {
 	struct cr_img *pmi;
 	struct cr_img *pi;
 
+	struct page_xfer *xfer;		/* for remote page reader */
+
 	PagemapEntry *pe;		/* current pagemap we are on */
 	struct page_read *parent;	/* parent pagemap (if ->in_parent
 					   pagemap is met in image, then
@@ -75,6 +79,8 @@ struct page_read {
 #define PR_TYPE_MASK	0x3
 #define PR_MOD		0x4	/* Will need to modify */
 
+#define PR_REMOTE	0x8	/* will read pages from remote host */
+
 /*
  * -1 -- error
  *  0 -- no images
diff --git a/criu/include/page-xfer.h b/criu/include/page-xfer.h
index 8492daa..86a7c21 100644
--- a/criu/include/page-xfer.h
+++ b/criu/include/page-xfer.h
@@ -17,6 +17,10 @@ struct page_xfer {
 	int (*write_pages)(struct page_xfer *self, int pipe, unsigned long len);
 	/* transfers one hole -- vaddr:len entry w/o pages */
 	int (*write_hole)(struct page_xfer *self, struct iovec *iov);
+
+	int (*read_pages)(struct page_xfer *self, unsigned long vaddr,
+			  int nr, void *buf);
+
 	void (*close)(struct page_xfer *self);
 
 	/* private data for every page-xfer engine */
@@ -44,4 +48,7 @@ extern int disconnect_from_page_server(void);
 
 extern int check_parent_page_xfer(int fd_type, long id);
 
+extern int page_xfer_read_pages(struct page_xfer *xfer, unsigned long vaddr,
+				int nr, void *buf);
+
 #endif /* __CR_PAGE_XFER__H__ */
diff --git a/criu/page-read.c b/criu/page-read.c
index e5ec76a..8e23dc5 100644
--- a/criu/page-read.c
+++ b/criu/page-read.c
@@ -6,6 +6,7 @@
 #include "cr_options.h"
 #include "servicefd.h"
 #include "page-read.h"
+#include "page-xfer.h"
 
 #include "protobuf.h"
 #include "images/pagemap.pb-c.h"
@@ -193,6 +194,13 @@ static int read_pagemap_page(struct page_read *pr, unsigned long vaddr, int nr,
 			vaddr += p_nr * PAGE_SIZE;
 			buf += p_nr * PAGE_SIZE;
 		} while (nr);
+	} else if (pr->xfer) {
+		pr_debug("\tpr%u Read %d remote pages %lx\n", pr->id, nr, vaddr);
+		ret = page_xfer_read_pages(pr->xfer, vaddr, nr, buf);
+		if (ret) {
+			pr_err("cannot get remote pages\n");
+			return -1;
+		}
 	} else {
 		int fd = img_raw_fd(pr->pi);
 		off_t current_vaddr = lseek(fd, 0, SEEK_CUR);
@@ -237,6 +245,11 @@ static void close_page_read(struct page_read *pr)
 	close_image(pr->pmi);
 	if (pr->pi)
 		close_image(pr->pi);
+
+	if (pr->xfer) {
+		pr->xfer->close(pr->xfer);
+		free(pr->xfer);
+	}
 }
 
 static int try_open_parent(int dfd, int pid, struct page_read *pr, int pr_flags)
@@ -301,9 +314,23 @@ int open_page_read_at(int dfd, int pid, struct page_read *pr, int pr_flags)
 
 	pr->pe = NULL;
 	pr->parent = NULL;
+	pr->xfer = NULL;
 	pr->bunch.iov_len = 0;
 	pr->bunch.iov_base = NULL;
 
+	if (pr_flags & PR_REMOTE) {
+		pr->xfer = malloc(sizeof(*pr->xfer));
+		if (!pr->xfer) {
+			pr_err("failed to reseve memory for page-xfer\n");
+			return -1;
+		}
+
+		if (open_page_xfer(pr->xfer, CR_FD_PAGEMAP, pid)) {
+			pr_err("failed to open page-xfer\n");
+			return -1;
+		}
+	}
+
 	pr->pmi = open_image_at(dfd, i_typ, O_RSTR, (long)pid);
 	if (!pr->pmi)
 		return -1;
@@ -318,17 +345,29 @@ int open_page_read_at(int dfd, int pid, struct page_read *pr, int pr_flags)
 		return -1;
 	}
 
-	pr->pi = open_pages_image_at(dfd, flags, pr->pmi);
-	if (!pr->pi) {
-		close_page_read(pr);
-		return -1;
+	if (pr_flags & PR_REMOTE) {
+		PagemapHead *h;
+		if (pb_read_one(pr->pmi, &h, PB_PAGEMAP_HEAD) < 0) {
+			pr_err("%s: pb_read_one\n", __func__);
+			return -1;
+		}
+		pagemap_head__free_unpacked(h, NULL);
+
+		pr->skip_pages = NULL;
+	} else {
+		pr->pi = open_pages_image_at(dfd, flags, pr->pmi);
+		if (!pr->pi) {
+			close_page_read(pr);
+			return -1;
+		}
+
+		pr->skip_pages = skip_pagemap_pages;
 	}
 
 	pr->get_pagemap = get_pagemap;
 	pr->put_pagemap = put_pagemap;
 	pr->read_pages = read_pagemap_page;
 	pr->close = close_page_read;
-	pr->skip_pages = skip_pagemap_pages;
 	pr->id = ids++;
 
 	pr_debug("Opened page read %u (parent %u)\n",
diff --git a/criu/page-xfer.c b/criu/page-xfer.c
index 2ebe8cc..df85976 100644
--- a/criu/page-xfer.c
+++ b/criu/page-xfer.c
@@ -13,6 +13,7 @@
 #include "image.h"
 #include "page-xfer.h"
 #include "page-pipe.h"
+#include "page-read.h"
 #include "util.h"
 #include "protobuf.h"
 #include "images/pagemap.pb-c.h"
@@ -43,6 +44,8 @@ static int open_page_local_xfer(struct page_xfer *xfer, int fd_type, long id);
 #define PS_IOV_OPEN	3
 #define PS_IOV_OPEN2	4
 #define PS_IOV_PARENT	5
+#define PS_IOV_OPEN3	6
+#define PS_IOV_GET	7
 
 #define PS_IOV_FLUSH		0x1023
 #define PS_IOV_FLUSH_N_CLOSE	0x1024
@@ -112,6 +115,24 @@ static int page_server_open(int sk, struct page_server_iov *pi)
 	return 0;
 }
 
+static int page_server_open3(int sk, struct page_server_iov *pi)
+{
+	int type;
+	long id;
+	char has_parent = 23;
+
+	type = decode_pm_type(pi->dst_id);
+	id = decode_pm_id(pi->dst_id);
+	pr_debug("Opening %d/%ld\n", type, id);
+
+	if (write(sk, &has_parent, 1) != 1) {
+		pr_perror("Unable to send reponse");
+		return -1;
+	}
+
+	return 0;
+}
+
 static int prep_loc_xfer(struct page_server_iov *pi)
 {
 	if (cxfer.dst_id != pi->dst_id) {
@@ -176,6 +197,57 @@ static int page_server_hole(int sk, struct page_server_iov *pi)
 	return 0;
 }
 
+static int page_server_get(int sk, struct page_server_iov *pi)
+{
+	struct page_read page_read;
+	struct iovec iov;
+	unsigned long len;
+	int type, id, ret;
+	char *buf;
+
+	type = decode_pm_type(pi->dst_id);
+	id = decode_pm_id(pi->dst_id);
+	pr_debug("Get %d/%d\n", type, id);
+
+	len = pi->nr_pages * PAGE_SIZE;
+	buf = malloc(len);
+	if (!buf) {
+		pr_err("allocation failed\n");
+		return -1;
+	}
+
+	open_page_read(id, &page_read, PR_TASK);
+
+	ret = page_read.get_pagemap(&page_read, &iov);
+	pr_debug("get_pagemap ret %d\n", ret);
+	if (ret <= 0)
+		return ret;
+
+	ret = seek_pagemap_page(&page_read, pi->vaddr, true);
+	pr_debug("seek_pagemap_page ret 0x%x\n", ret);
+	if (ret <= 0)
+		return ret;
+
+	ret = page_read.read_pages(&page_read, pi->vaddr, pi->nr_pages, buf);
+	if (ret < 0) {
+		pr_err("%s: read_pages: %d\n", __func__, ret);
+		goto out;
+	}
+
+	ret = write(sk, buf, len);
+	if (ret != len) {
+		pr_err("%s: Can't send the pages:%d\n", __func__, ret);
+		ret = -1;
+		goto out;
+	}
+
+	page_read.close(&page_read);
+	ret = 0;
+out:
+	free(buf);
+	return ret;
+}
+
 static int page_server_check_parent(int sk, struct page_server_iov *pi);
 
 static int page_server_serve(int sk)
@@ -221,6 +293,9 @@ static int page_server_serve(int sk)
 		case PS_IOV_OPEN2:
 			ret = page_server_open(sk, &pi);
 			break;
+		case PS_IOV_OPEN3:
+			ret = page_server_open3(sk, &pi);
+			break;
 		case PS_IOV_PARENT:
 			ret = page_server_check_parent(sk, &pi);
 			break;
@@ -230,6 +305,9 @@ static int page_server_serve(int sk)
 		case PS_IOV_HOLE:
 			ret = page_server_hole(sk, &pi);
 			break;
+		case PS_IOV_GET:
+			ret = page_server_get(sk, &pi);
+			break;
 		case PS_IOV_FLUSH:
 		case PS_IOV_FLUSH_N_CLOSE:
 		{
@@ -322,8 +400,10 @@ static int page_server_sk = -1;
 
 int connect_to_page_server(void)
 {
-	if (!opts.use_page_server)
+	if (!(opts.use_page_server || opts.use_page_client)) {
+		pr_err("mutually exclusive page-server and page-client options\n");
 		return 0;
+	}
 
 	if (opts.ps_socket != -1) {
 		page_server_sk = opts.ps_socket;
@@ -332,8 +412,11 @@ int connect_to_page_server(void)
 	}
 
 	page_server_sk = setup_tcp_client(opts.addr);
-	if (page_server_sk == -1)
+	if (page_server_sk == -1) {
+		pr_err("setup_tcp_client\n");
 		return -1;
+	}
+
 out:
 	/*
 	 * CORK the socket at the very beginning. As per ANK
@@ -431,6 +514,33 @@ static int write_hole_to_server(struct page_xfer *xfer, struct iovec *iov)
 	return 0;
 }
 
+static int read_pages_from_server(struct page_xfer *xfer, unsigned long vaddr,
+				  int nr, void *buf)
+{
+	struct page_server_iov pi;
+	unsigned long len;
+	int ret;
+
+	pi.cmd = PS_IOV_GET;
+	pi.dst_id = xfer->dst_id;
+	pi.nr_pages = nr;
+	pi.vaddr = vaddr;
+	len = nr * page_size();
+
+	if (write(xfer->sk, &pi, sizeof(pi)) != sizeof(pi)) {
+		pr_perror("Can't write GET cmd to server");
+		return -1;
+	}
+
+	ret = recv(xfer->sk, buf, len, MSG_WAITALL);
+	if (ret != len) {
+		pr_err("%s: recv failed: %d\n", __func__, ret);
+		return -1;
+	}
+
+	return 0;
+}
+
 static void close_server_xfer(struct page_xfer *xfer)
 {
 	xfer->sk = -1;
@@ -445,6 +555,7 @@ static int open_page_server_xfer(struct page_xfer *xfer, int fd_type, long id)
 	xfer->write_pagemap = write_pagemap_to_server;
 	xfer->write_pages = write_pages_to_server;
 	xfer->write_hole = write_hole_to_server;
+	xfer->read_pages = read_pages_from_server;
 	xfer->close = close_server_xfer;
 	xfer->dst_id = encode_pm_id(fd_type, id);
 	xfer->parent = NULL;
@@ -473,6 +584,46 @@ static int open_page_server_xfer(struct page_xfer *xfer, int fd_type, long id)
 	return 0;
 }
 
+static void close_client_xfer(struct page_xfer *xfer)
+{
+	close(xfer->sk);
+}
+
+static int open_page_client_xfer(struct page_xfer *xfer, int fd_type, long id)
+{
+	struct page_server_iov pi;
+	char has_parent;
+
+	connect_to_page_server();
+
+	xfer->sk = page_server_sk;
+	xfer->read_pages = read_pages_from_server;
+	xfer->close = close_client_xfer;
+	xfer->dst_id = encode_pm_id(fd_type, id);
+	xfer->parent = NULL;
+
+	pi.cmd = PS_IOV_OPEN3;
+	pi.dst_id = xfer->dst_id;
+	pi.vaddr = 0;
+	pi.nr_pages = 0;
+
+	if (write(xfer->sk, &pi, sizeof(pi)) != sizeof(pi)) {
+		pr_perror("Can't write to page server");
+		return -1;
+	}
+
+	/* Push the command NOW */
+	tcp_nodelay(xfer->sk, true);
+
+	if (read(xfer->sk, &has_parent, 1) != 1) {
+		pr_perror("The page server doesn't answer");
+		return -1;
+	}
+	pr_debug("has_parent=%d\n", has_parent);
+
+	return 0;
+}
+
 static int write_pagemap_loc(struct page_xfer *xfer,
 		struct iovec *iov)
 {
@@ -703,6 +854,8 @@ int open_page_xfer(struct page_xfer *xfer, int fd_type, long id)
 {
 	if (opts.use_page_server)
 		return open_page_server_xfer(xfer, fd_type, id);
+	else if (opts.use_page_client)
+		return open_page_client_xfer(xfer, fd_type, id);
 	else
 		return open_page_local_xfer(xfer, fd_type, id);
 }
@@ -785,3 +938,9 @@ int check_parent_page_xfer(int fd_type, long id)
 	else
 		return check_parent_local_xfer(fd_type, id);
 }
+
+int page_xfer_read_pages(struct page_xfer *xfer, unsigned long vaddr,
+			 int nr, void *buf)
+{
+	return xfer->read_pages ? xfer->read_pages(xfer, vaddr, nr, buf) : -1;
+}
-- 
1.9.1



More information about the CRIU mailing list