[CRIU] Remote lazy-restore design discussion
Pavel Emelyanov
xemul at virtuozzo.com
Tue Apr 5 09:04:45 PDT 2016
On 04/04/2016 07:20 PM, Mike Rapoport wrote:
> On Mon, Apr 04, 2016 at 04:06:50PM +0300, Pavel Emelyanov wrote:
>> On 03/31/2016 05:25 PM, Adrian Reber wrote:
>>> Hello Pavel,
>>>
>>> after Mike asked if there have been any design discussions and after I
>>> am not 100% sure how the page-server fits into the remote restore, it
>>> seems to be a good idea to get a common understanding what the right
>>> implementation for remote lazy-restore should look like.
>>>
>>> I am using my implementation as a starting point for the discussion.
>>>
>>> I think we need three different process for remote lazy restore
>>> independent of how they are started. 'destination system' is the system
>>> the process should be migrated to and 'source system' is the system the
>>> original process was running on before the migration.
>>>
>>> 1. The actual restore process (destination system):
>>> This is a 'normal' restore with the difference that memory pages
>>> (MAP_ANONYMOUS and MAP_PRIVATE) are not copied to their place
>>> but they are marked as being handled by userfaultfd. Therefore
>>> a userfaultfd FD (UFFD) is opened and passed to a second process.
>>>
>>> 2. The local lazy restore UFFD handler (destination system):
>>> This process listens on the UFFD for userfault requests and tries to
>>> handle the userfault requests. Either by reading the required pages
>>> from a local checkpoint (rather unlikely use case) or it is requesting
>>> the pages from a remote system (source system) via the network.
>>>
>>> 3. The remote lazy restore page request handler (source system):
>>> This process opens a network port and listens for page requests
>>> and reads the requested pages from a local checkpoint (or even
>>> better, directly from a stopped process).
>>
>> Agreed. And the process #1 would eventually turn into the restored process(es).
>>
>> I would also add that process 3 should not only listen for page requests, but
>> also send other pages in the background. Probably the ideal process 3 should
>>
>> 1. Have a queue of pages to be sent (struct page_server_iov-s)
>> 2. Fill it with pages that were not transfered (ANON|PRIVATE)
>> 3. Start sending them one by one
>> 4. Receive messages from the process #3 that can move some items from
>> the queue on top (i.e. -- the pages that are needed right now)
>
> Well, I actually thought more about "pull" than "push" approach. The pages
> are anyway collected into pagemap.img and it may be shared between source
> and destination. Than page-read on the restore (destination) side works
> almost as now, just instead of read(fd, ...) it does recv(sock, ...).
> I have some ugly POC (below) that kinda demonstrates the idea.
>
> If I understand you idea correctly, the dump side requires addition of
> beckground process that will handle random pages requests.
Well, yes. Think of it this way -- the restore side daemon will have to
poll for many uffds and handle incoming data from socket. So it will be
a state machine that keeps track of pids, associated uffds and the reasons
for which they are hung. So adding there background traffic shouldn't be
a big deal.
> Except that, it
> will work as described at disk-less migration page on criu wiki.
> The restore side, however, should be able to request faulting random pages
> from the remote and also take care of incoming stream of pages from private
> anononymous mappings.
>
>>> As this describes the solution I have implemented it all sounds correct
>>> to me. In addition to handling request for pages (processes 2. and 3.)
>>> both page handlers need to know how to push unrequested pages at some
>>> point in time to make sure the migration can finish.
>>>
>>> Looking at the page-server it is currently not clear to me how it fits
>>> into this scenario. Currently it listens on a network port (like process
>>> 3. from above) and writes the received pages to the local disk.
>>
>> Not exactly. It redirects pages from socket into particular page_xfer. Right
>> ow the page server process only uses the local xfer which results in pages
>> being written on disk.
>>
>> Also, the page server includes page_server_xfer which is used by criu dump
>> to send the page, and this thing should be used by process 3.
>
>>> To serve as the process mention as process 3. it would need to learn all
>>> the functionality as it has currently been implemented.
>>
>> You mean the page server should be taught to work with uffd? Well, kinda yes,
>> when I was talking about uffd daemon to use page server, I meant, that the
>> uffd process (#2 in your classification) should use page server protocol and
>> new page_xfer to transfer pages between hosts. And process #3 should use
>> standard page_server_xfer to transfer pages onto remote host.
>
> In this case the new page_xfer should be able to put pages directly into
> the memory of the process being restored, so it should be aware of current
> mappings for the private anononymous VMAs.
> My gut feeling says it'll complicate synchronization between dump and
> restore sides.
Hm... Maybe...
>>> Instead of receiving pages and writing it to disk it needs to
>>> receive page requests and read the from disk to the network.
>>
>> Why to disk? For post-copy live migration using disk for images should
>> be avoided as much as possible.
>>
>>> This sounds like the opposite of what it is currently doing and,
>>> from my point of view, it is either a complete separate process,
>>> like my implementation, or all the functionality needs to be added.
>>> Also the logic to handle unrequested pages does not seem like
>>> something which the page-server can currently do or is designed to do.
>>>
>>> So, from my point of view, page-server and remote page request handler
>>> seem rather different in their functionality (besides being a TCP
>>> server). I suppose there are some points I am not seeing so I hope to
>>> understand the situation better from the answers to this email. Thanks.
>>
>> Probably I was not correct when used the word "page-server". I meant the
>> components used by it, but you thought of it as of a process itself :)
>
> If I'd summarize my thoughts, I'd say I still don't see a clear picture of
> overall post-copy implementation :)
> I'll try to do some more homework and I'll try to somehow put the puzzle
> pieces together.
Well, how about this:
I. Dump side.
The criu dump process dumps everything but the lazy pagemaps, lazy pagemaps
are skipped and are queued.
Then criu dump spawns a daemon that opens a connection to the remote host,
creates page_server_xfer, takes the queue of pagemaps-s that are to be sent
to it and starts polling the xfer socket.
When available for read, it gets request for particular pagemap and pops one
up in the queue.
When available for write it gets the next pagemap from queue and ->writepage
one to the page_xfer.
II. Restore side
The uffd daemon is spawned, it opens a port (to which dump will connect, or
uses the opts.ps_socket provided connection, the connect_to_page_server()
knows this), creates a hash with (pid, uffd, pagemaps) structures (called
lazy_data below) and listens.
Restore prepares processes and mappings (Adrian's code already does this), sending
uffd-s to the uffd daemon (already there).
The uffd daemon starts polling all uffds it has and the connection from the
dump side.
When uffd is available for read, it gets the #PF info, the goes to the new
page_read that sends the page_server_iov request for out-of-order page (note,
that in case of lazy restore from images the regular page_read is used).
Most of this code is already in criu-dev from Adrian and you, but we need to
add multi-uffd polling and lazy_data thing and the ability to handle "page
will be available later" response from the page_read.
When dump side connection is available for reading it calls the core part of
the page_server_serve() routine that reads from socket and handles PS_IOV_FOO
commands. The page_xfer used in _this_ case is the one that finds the appropriate
lazy_data and calls map + wakeup ioctls.
This part is not ready and this is what I meant when was talking about re-using
page-server code with new page_xfer and page_read.
Does this make sense?
-- Pavel
>> -- Pavel
>
>>From 9dcc0588776c2974e698f651211472fcbb6bfc76 Mon Sep 17 00:00:00 2001
> From: Mike Rapoport <rppt at linux.vnet.ibm.com>
> Date: Mon, 4 Apr 2016 17:50:13 +0300
> Subject: [UGLY PATCH] allow fetching pages from remote page-server
>
> ---
> criu/cr-restore.c | 5 +-
> criu/crtools.c | 4 ++
> criu/include/cr_options.h | 1 +
> criu/include/page-read.h | 6 ++
> criu/include/page-xfer.h | 7 ++
> criu/page-read.c | 49 ++++++++++++--
> criu/page-xfer.c | 163 +++++++++++++++++++++++++++++++++++++++++++++-
> 7 files changed, 227 insertions(+), 8 deletions(-)
>
> diff --git a/criu/cr-restore.c b/criu/cr-restore.c
> index 2f51344..9843b46 100644
> --- a/criu/cr-restore.c
> +++ b/criu/cr-restore.c
> @@ -453,10 +453,13 @@ static int restore_priv_vma_content(void)
> unsigned int nr_lazy = 0;
> unsigned long va;
> struct page_read pr;
> + int pr_flags = PR_TASK;
>
> vma = list_first_entry(vmas, struct vma_area, list);
>
> - ret = open_page_read(current->pid.virt, &pr, PR_TASK);
> + if (opts.use_page_client)
> + pr_flags |= PR_REMOTE;
> + ret = open_page_read(current->pid.virt, &pr, pr_flags);
> if (ret <= 0)
> return -1;
>
> diff --git a/criu/crtools.c b/criu/crtools.c
> index 6785c78..22449ee 100644
> --- a/criu/crtools.c
> +++ b/criu/crtools.c
> @@ -325,6 +325,7 @@ int main(int argc, char *argv[], char *envp[])
> { "extra", no_argument, 0, 1077 },
> { "experimental", no_argument, 0, 1078 },
> { "all", no_argument, 0, 1079 },
> + { "page-client", no_argument, 0, 1080 },
> { },
> };
>
> @@ -626,6 +627,9 @@ int main(int argc, char *argv[], char *envp[])
> opts.check_extra_features = true;
> opts.check_experimental_features = true;
> break;
> + case 1080:
> + opts.use_page_client = true;
> + break;
> case 'V':
> pr_msg("Version: %s\n", CRIU_VERSION);
> if (strcmp(CRIU_GITID, "0"))
> diff --git a/criu/include/cr_options.h b/criu/include/cr_options.h
> index 4853bea..07d9d4f 100644
> --- a/criu/include/cr_options.h
> +++ b/criu/include/cr_options.h
> @@ -86,6 +86,7 @@ struct cr_options {
> struct list_head external;
> char *libdir;
> bool use_page_server;
> + bool use_page_client;
> unsigned short port;
> char *addr;
> int ps_socket;
> diff --git a/criu/include/page-read.h b/criu/include/page-read.h
> index 3ba1ee9..abeb4c6 100644
> --- a/criu/include/page-read.h
> +++ b/criu/include/page-read.h
> @@ -40,6 +40,8 @@
> * All this is implemented in read_pagemap_page.
> */
>
> +struct page_xfer;
> +
> struct page_read {
> /*
> * gets next vaddr:len pair to work on.
> @@ -57,6 +59,8 @@ struct page_read {
> struct cr_img *pmi;
> struct cr_img *pi;
>
> + struct page_xfer *xfer; /* for remote page reader */
> +
> PagemapEntry *pe; /* current pagemap we are on */
> struct page_read *parent; /* parent pagemap (if ->in_parent
> pagemap is met in image, then
> @@ -75,6 +79,8 @@ struct page_read {
> #define PR_TYPE_MASK 0x3
> #define PR_MOD 0x4 /* Will need to modify */
>
> +#define PR_REMOTE 0x8 /* will read pages from remote host */
> +
> /*
> * -1 -- error
> * 0 -- no images
> diff --git a/criu/include/page-xfer.h b/criu/include/page-xfer.h
> index 8492daa..86a7c21 100644
> --- a/criu/include/page-xfer.h
> +++ b/criu/include/page-xfer.h
> @@ -17,6 +17,10 @@ struct page_xfer {
> int (*write_pages)(struct page_xfer *self, int pipe, unsigned long len);
> /* transfers one hole -- vaddr:len entry w/o pages */
> int (*write_hole)(struct page_xfer *self, struct iovec *iov);
> +
> + int (*read_pages)(struct page_xfer *self, unsigned long vaddr,
> + int nr, void *buf);
> +
> void (*close)(struct page_xfer *self);
>
> /* private data for every page-xfer engine */
> @@ -44,4 +48,7 @@ extern int disconnect_from_page_server(void);
>
> extern int check_parent_page_xfer(int fd_type, long id);
>
> +extern int page_xfer_read_pages(struct page_xfer *xfer, unsigned long vaddr,
> + int nr, void *buf);
> +
> #endif /* __CR_PAGE_XFER__H__ */
> diff --git a/criu/page-read.c b/criu/page-read.c
> index e5ec76a..8e23dc5 100644
> --- a/criu/page-read.c
> +++ b/criu/page-read.c
> @@ -6,6 +6,7 @@
> #include "cr_options.h"
> #include "servicefd.h"
> #include "page-read.h"
> +#include "page-xfer.h"
>
> #include "protobuf.h"
> #include "images/pagemap.pb-c.h"
> @@ -193,6 +194,13 @@ static int read_pagemap_page(struct page_read *pr, unsigned long vaddr, int nr,
> vaddr += p_nr * PAGE_SIZE;
> buf += p_nr * PAGE_SIZE;
> } while (nr);
> + } else if (pr->xfer) {
> + pr_debug("\tpr%u Read %d remote pages %lx\n", pr->id, nr, vaddr);
> + ret = page_xfer_read_pages(pr->xfer, vaddr, nr, buf);
> + if (ret) {
> + pr_err("cannot get remote pages\n");
> + return -1;
> + }
> } else {
> int fd = img_raw_fd(pr->pi);
> off_t current_vaddr = lseek(fd, 0, SEEK_CUR);
> @@ -237,6 +245,11 @@ static void close_page_read(struct page_read *pr)
> close_image(pr->pmi);
> if (pr->pi)
> close_image(pr->pi);
> +
> + if (pr->xfer) {
> + pr->xfer->close(pr->xfer);
> + free(pr->xfer);
> + }
> }
>
> static int try_open_parent(int dfd, int pid, struct page_read *pr, int pr_flags)
> @@ -301,9 +314,23 @@ int open_page_read_at(int dfd, int pid, struct page_read *pr, int pr_flags)
>
> pr->pe = NULL;
> pr->parent = NULL;
> + pr->xfer = NULL;
> pr->bunch.iov_len = 0;
> pr->bunch.iov_base = NULL;
>
> + if (pr_flags & PR_REMOTE) {
> + pr->xfer = malloc(sizeof(*pr->xfer));
> + if (!pr->xfer) {
> + pr_err("failed to reseve memory for page-xfer\n");
> + return -1;
> + }
> +
> + if (open_page_xfer(pr->xfer, CR_FD_PAGEMAP, pid)) {
> + pr_err("failed to open page-xfer\n");
> + return -1;
> + }
> + }
> +
> pr->pmi = open_image_at(dfd, i_typ, O_RSTR, (long)pid);
> if (!pr->pmi)
> return -1;
> @@ -318,17 +345,29 @@ int open_page_read_at(int dfd, int pid, struct page_read *pr, int pr_flags)
> return -1;
> }
>
> - pr->pi = open_pages_image_at(dfd, flags, pr->pmi);
> - if (!pr->pi) {
> - close_page_read(pr);
> - return -1;
> + if (pr_flags & PR_REMOTE) {
> + PagemapHead *h;
> + if (pb_read_one(pr->pmi, &h, PB_PAGEMAP_HEAD) < 0) {
> + pr_err("%s: pb_read_one\n", __func__);
> + return -1;
> + }
> + pagemap_head__free_unpacked(h, NULL);
> +
> + pr->skip_pages = NULL;
> + } else {
> + pr->pi = open_pages_image_at(dfd, flags, pr->pmi);
> + if (!pr->pi) {
> + close_page_read(pr);
> + return -1;
> + }
> +
> + pr->skip_pages = skip_pagemap_pages;
> }
>
> pr->get_pagemap = get_pagemap;
> pr->put_pagemap = put_pagemap;
> pr->read_pages = read_pagemap_page;
> pr->close = close_page_read;
> - pr->skip_pages = skip_pagemap_pages;
> pr->id = ids++;
>
> pr_debug("Opened page read %u (parent %u)\n",
> diff --git a/criu/page-xfer.c b/criu/page-xfer.c
> index 2ebe8cc..df85976 100644
> --- a/criu/page-xfer.c
> +++ b/criu/page-xfer.c
> @@ -13,6 +13,7 @@
> #include "image.h"
> #include "page-xfer.h"
> #include "page-pipe.h"
> +#include "page-read.h"
> #include "util.h"
> #include "protobuf.h"
> #include "images/pagemap.pb-c.h"
> @@ -43,6 +44,8 @@ static int open_page_local_xfer(struct page_xfer *xfer, int fd_type, long id);
> #define PS_IOV_OPEN 3
> #define PS_IOV_OPEN2 4
> #define PS_IOV_PARENT 5
> +#define PS_IOV_OPEN3 6
> +#define PS_IOV_GET 7
>
> #define PS_IOV_FLUSH 0x1023
> #define PS_IOV_FLUSH_N_CLOSE 0x1024
> @@ -112,6 +115,24 @@ static int page_server_open(int sk, struct page_server_iov *pi)
> return 0;
> }
>
> +static int page_server_open3(int sk, struct page_server_iov *pi)
> +{
> + int type;
> + long id;
> + char has_parent = 23;
> +
> + type = decode_pm_type(pi->dst_id);
> + id = decode_pm_id(pi->dst_id);
> + pr_debug("Opening %d/%ld\n", type, id);
> +
> + if (write(sk, &has_parent, 1) != 1) {
> + pr_perror("Unable to send reponse");
> + return -1;
> + }
> +
> + return 0;
> +}
> +
> static int prep_loc_xfer(struct page_server_iov *pi)
> {
> if (cxfer.dst_id != pi->dst_id) {
> @@ -176,6 +197,57 @@ static int page_server_hole(int sk, struct page_server_iov *pi)
> return 0;
> }
>
> +static int page_server_get(int sk, struct page_server_iov *pi)
> +{
> + struct page_read page_read;
> + struct iovec iov;
> + unsigned long len;
> + int type, id, ret;
> + char *buf;
> +
> + type = decode_pm_type(pi->dst_id);
> + id = decode_pm_id(pi->dst_id);
> + pr_debug("Get %d/%d\n", type, id);
> +
> + len = pi->nr_pages * PAGE_SIZE;
> + buf = malloc(len);
> + if (!buf) {
> + pr_err("allocation failed\n");
> + return -1;
> + }
> +
> + open_page_read(id, &page_read, PR_TASK);
> +
> + ret = page_read.get_pagemap(&page_read, &iov);
> + pr_debug("get_pagemap ret %d\n", ret);
> + if (ret <= 0)
> + return ret;
> +
> + ret = seek_pagemap_page(&page_read, pi->vaddr, true);
> + pr_debug("seek_pagemap_page ret 0x%x\n", ret);
> + if (ret <= 0)
> + return ret;
> +
> + ret = page_read.read_pages(&page_read, pi->vaddr, pi->nr_pages, buf);
> + if (ret < 0) {
> + pr_err("%s: read_pages: %d\n", __func__, ret);
> + goto out;
> + }
> +
> + ret = write(sk, buf, len);
> + if (ret != len) {
> + pr_err("%s: Can't send the pages:%d\n", __func__, ret);
> + ret = -1;
> + goto out;
> + }
> +
> + page_read.close(&page_read);
> + ret = 0;
> +out:
> + free(buf);
> + return ret;
> +}
> +
> static int page_server_check_parent(int sk, struct page_server_iov *pi);
>
> static int page_server_serve(int sk)
> @@ -221,6 +293,9 @@ static int page_server_serve(int sk)
> case PS_IOV_OPEN2:
> ret = page_server_open(sk, &pi);
> break;
> + case PS_IOV_OPEN3:
> + ret = page_server_open3(sk, &pi);
> + break;
> case PS_IOV_PARENT:
> ret = page_server_check_parent(sk, &pi);
> break;
> @@ -230,6 +305,9 @@ static int page_server_serve(int sk)
> case PS_IOV_HOLE:
> ret = page_server_hole(sk, &pi);
> break;
> + case PS_IOV_GET:
> + ret = page_server_get(sk, &pi);
> + break;
> case PS_IOV_FLUSH:
> case PS_IOV_FLUSH_N_CLOSE:
> {
> @@ -322,8 +400,10 @@ static int page_server_sk = -1;
>
> int connect_to_page_server(void)
> {
> - if (!opts.use_page_server)
> + if (!(opts.use_page_server || opts.use_page_client)) {
> + pr_err("mutually exclusive page-server and page-client options\n");
> return 0;
> + }
>
> if (opts.ps_socket != -1) {
> page_server_sk = opts.ps_socket;
> @@ -332,8 +412,11 @@ int connect_to_page_server(void)
> }
>
> page_server_sk = setup_tcp_client(opts.addr);
> - if (page_server_sk == -1)
> + if (page_server_sk == -1) {
> + pr_err("setup_tcp_client\n");
> return -1;
> + }
> +
> out:
> /*
> * CORK the socket at the very beginning. As per ANK
> @@ -431,6 +514,33 @@ static int write_hole_to_server(struct page_xfer *xfer, struct iovec *iov)
> return 0;
> }
>
> +static int read_pages_from_server(struct page_xfer *xfer, unsigned long vaddr,
> + int nr, void *buf)
> +{
> + struct page_server_iov pi;
> + unsigned long len;
> + int ret;
> +
> + pi.cmd = PS_IOV_GET;
> + pi.dst_id = xfer->dst_id;
> + pi.nr_pages = nr;
> + pi.vaddr = vaddr;
> + len = nr * page_size();
> +
> + if (write(xfer->sk, &pi, sizeof(pi)) != sizeof(pi)) {
> + pr_perror("Can't write GET cmd to server");
> + return -1;
> + }
> +
> + ret = recv(xfer->sk, buf, len, MSG_WAITALL);
> + if (ret != len) {
> + pr_err("%s: recv failed: %d\n", __func__, ret);
> + return -1;
> + }
> +
> + return 0;
> +}
> +
> static void close_server_xfer(struct page_xfer *xfer)
> {
> xfer->sk = -1;
> @@ -445,6 +555,7 @@ static int open_page_server_xfer(struct page_xfer *xfer, int fd_type, long id)
> xfer->write_pagemap = write_pagemap_to_server;
> xfer->write_pages = write_pages_to_server;
> xfer->write_hole = write_hole_to_server;
> + xfer->read_pages = read_pages_from_server;
> xfer->close = close_server_xfer;
> xfer->dst_id = encode_pm_id(fd_type, id);
> xfer->parent = NULL;
> @@ -473,6 +584,46 @@ static int open_page_server_xfer(struct page_xfer *xfer, int fd_type, long id)
> return 0;
> }
>
> +static void close_client_xfer(struct page_xfer *xfer)
> +{
> + close(xfer->sk);
> +}
> +
> +static int open_page_client_xfer(struct page_xfer *xfer, int fd_type, long id)
> +{
> + struct page_server_iov pi;
> + char has_parent;
> +
> + connect_to_page_server();
> +
> + xfer->sk = page_server_sk;
> + xfer->read_pages = read_pages_from_server;
> + xfer->close = close_client_xfer;
> + xfer->dst_id = encode_pm_id(fd_type, id);
> + xfer->parent = NULL;
> +
> + pi.cmd = PS_IOV_OPEN3;
> + pi.dst_id = xfer->dst_id;
> + pi.vaddr = 0;
> + pi.nr_pages = 0;
> +
> + if (write(xfer->sk, &pi, sizeof(pi)) != sizeof(pi)) {
> + pr_perror("Can't write to page server");
> + return -1;
> + }
> +
> + /* Push the command NOW */
> + tcp_nodelay(xfer->sk, true);
> +
> + if (read(xfer->sk, &has_parent, 1) != 1) {
> + pr_perror("The page server doesn't answer");
> + return -1;
> + }
> + pr_debug("has_parent=%d\n", has_parent);
> +
> + return 0;
> +}
> +
> static int write_pagemap_loc(struct page_xfer *xfer,
> struct iovec *iov)
> {
> @@ -703,6 +854,8 @@ int open_page_xfer(struct page_xfer *xfer, int fd_type, long id)
> {
> if (opts.use_page_server)
> return open_page_server_xfer(xfer, fd_type, id);
> + else if (opts.use_page_client)
> + return open_page_client_xfer(xfer, fd_type, id);
> else
> return open_page_local_xfer(xfer, fd_type, id);
> }
> @@ -785,3 +938,9 @@ int check_parent_page_xfer(int fd_type, long id)
> else
> return check_parent_local_xfer(fd_type, id);
> }
> +
> +int page_xfer_read_pages(struct page_xfer *xfer, unsigned long vaddr,
> + int nr, void *buf)
> +{
> + return xfer->read_pages ? xfer->read_pages(xfer, vaddr, nr, buf) : -1;
> +}
>
More information about the CRIU
mailing list