[CRIU] Process Migration using Sockets v2 - Patch 1/2

Rodrigo Bruno rbruno at gsd.inesc-id.pt
Tue Oct 13 07:06:31 PDT 2015


On Mon, 12 Oct 2015 14:44:00 +0300
Pavel Emelyanov <xemul at parallels.com> wrote:

> On 10/12/2015 01:42 PM, Rodrigo Bruno wrote:
> 
> >>> Protocol details:
> >>>
> >>> 1. all data channels (i.e., sockets used to send/receive images) are unix sockets
> >>> except for the image-cache channel used to receive images from the image-proxy.
> >>>
> >>> CRIU Dump <-- UNIX SOCKET --> image-proxy ---> TCP SOCKET ---> image-cache ---> UNIX SOCKET ---> CRIU Restore
> >>> (notice the two-sided arrow between dump and image-proxy. This is necessary because 
> >>> predumps read image files from the image-proxy)
> >>>
> >>> 2. read (or get) and write (or put) connections use different server sockets, one for 
> >>> receiving read requests and another for write requests
> >>
> >> Why different? All sockets in Linux are full-duplex, you can use one connection in
> >> two directions safely.
> > 
> > No real reason. It is possible to do it with one (I have to specify in the header if I
> > want to read or write the image). I can do that.
> 
> Thanks :)
> 
> >>
> >>> 3. Read connections are initiated as follows:
> >>> 	a) the requester writes an image header
> >>> 	b) the requester reads an image header
> >>> 	c) the requester checks if the received image header matches.
> >>
> >> What image headers are exchanged?
> > 
> > In step a) goes the name and namespace of the image that needs to be read. In step b)
> > goes the same information or a special name (DUMP_FINISH) meaning that the image does
> > not exist. For example, the CRIU restore might check for files that will never exist 
> > for a specific process.
> > 
> > I send an image header back (step b) ) instead of closing the connection right away 
> > because the restore side could not decide if the connection went down because of an
> > error or if the image does not exist.
> 
> Ah, I see. Let's better introduce a "response" message to the protocol containing the
> "errno" field, this will allow to better understand the reasons of failures.

Okey. I can do that.

> 
> >> And why should receiver check for any match? What can happen in the remote side,
> >> that the requestor gets different object than it asked for?
> >>
> >>> 4. Write connections only write the image header before sending the actual content.
> >>>
> >>> 5. the image header is a protobuf object that contains two strings: the image name
> >>> and the image namespace (the namespace identifies the process that created the image).
> >>>
> >>> 6. only connections between the image-proxy and the image-cache (which are TCP) check
> >>> for file boundaries. Imagine the image-proxy starts to forward an image to image-cache:
> >>> 	a) write image-header
> >>> 	b) write image size (uint64_t)
> >>
> >> Why isn't the size in the header?
> > 
> > Good point. The image-header object is used in all connections (dump, proxy, cache, restore).
> > This image size is only used between the proxy and cache (because I know the size of the image).
> > 
> > Maybe I can have a different image header with the size?
> 
> I haven't yet fully understood the difference between criu-proxy/cache and proxy-cache
> protocols. But from what I have :) I think that it's worth having two different headers,
> one for "local" communications (criu-proxy/cache) and the other one for remote (cache-proxy).

Yes, one header with size for proxy->cache communication, and a header without size for
all others (dump<->proxy, and restore<-cache).

Both protocols are very similar. The only difference is that in cache<-proxy communication
I send the size of the file before the atual file.

Example:
open connection to write
write header (name + namespace) (protobuf object)
write size uint64_t
write image
close

The new version (discussed in these emails) will result in:
get the already open connection (inherited from the user)
write header (name + namespace + size) (protobuf object)
write image
write closing header (to replace the close)


> 
> >>
> >>> 	c) the actual content, then close
> >>
> >> So you use one connection per-image? That's too expensive, in our scenarios live
> >> migration will occur via pre-established connection and criu will not be allowed
> >> to open more.
> > 
> > Okey. Between dump and proxy, and restore and cache, is it okey to have 
> > multiple connections?
> 
> Absolutely.
> 
> > This is the most difficult point. If this is a problem,
> > then my solution is not enough since I cannot use a different fd for each file.
> 
> Yes, I agree, having multiple connections between criu and proxy/cache would
> be painful.
> 
> > Have only one connection between proxy and cache (the only connection that is not
> > local) is relatively easy. I can do that.
> 
> Thanks!
> 
> >>
> >>> When the connection is closed, the image-cache checks if it received the correct number of
> >>> bytes.
> >>>
> >>> In this patch, two new files are added:
> >>> - include/image-remote.h
> >>> 	This file presents the interface (set of functions) that are used by CRIU
> >>> to migrate processes using sockets. They provide ways for CRIU to open read and write
> >>> image connections, check is a particular file exists, etc.
> >>
> >> These connections are to proxy and/or cache, right?
> > 
> > Yes.
> > 
> >>
> >>> - image-remote.c
> >>> 	This file implements the functions described in include/image-remote.h
> >>
> >> OK, code comments are inline too.
> >>
> >>> Signed-off-by: Rodrigo Bruno <rbruno at gsd.inesc-id.pt>
> >>> >From 1793ba84942e03900755378a6fc2642b8734c13d Mon Sep 17 00:00:00 2001
> >>> From: rodrigo-bruno <rbruno at gsd.inesc-id.pt>
> >>> Date: Fri, 9 Oct 2015 14:37:31 +0100
> >>> Subject: [PATCH] Process migration using sockets (1/2)
> >>>
> >>> ---
> >>>  Makefile.crtools            |   4 +
> >>>  cr-dedup.c                  |   1 +
> >>>  cr-dump.c                   |  16 +++
> >>>  crtools.c                   |  23 +++-
> >>>  image-remote.c              | 258 ++++++++++++++++++++++++++++++++++++++++++++
> >>>  image.c                     |  56 ++++++++--
> >>>  include/cr_options.h        |   1 +
> >>>  include/image-remote.h      |  87 +++++++++++++++
> >>>  include/protobuf-desc.h     |   2 +
> >>>  page-read.c                 |  50 +++++++--
> >>>  page-xfer.c                 |  31 +++++-
> >>>  protobuf-desc.c             |   1 +
> >>>  protobuf/Makefile           |   1 +
> >>>  protobuf/remote-image.proto |   8 ++
> >>>  14 files changed, 516 insertions(+), 23 deletions(-)
> >>>  create mode 100644 image-remote.c
> >>>  create mode 100644 include/image-remote.h
> >>>  create mode 100644 protobuf/remote-image.proto
> >>>
> >>> diff --git a/Makefile.crtools b/Makefile.crtools
> >>> index 80f704f..4a74fa8 100644
> >>> --- a/Makefile.crtools
> >>> +++ b/Makefile.crtools
> >>> @@ -6,6 +6,10 @@ obj-y	+= crtools.o
> >>>  obj-y	+= security.o
> >>>  obj-y	+= image.o
> >>>  obj-y	+= image-desc.o
> >>> +obj-y	+= image-remote.o
> >>> +obj-y	+= image-proxy.o
> >>> +obj-y	+= image-cache.o
> >>> +obj-y	+= image-remote-pvt.o
> >>>  obj-y	+= net.o
> >>>  obj-y	+= tun.o
> >>>  obj-y	+= proc_parse.o
> >>> diff --git a/cr-dedup.c b/cr-dedup.c
> >>> index b453c3e..77f0b39 100644
> >>> --- a/cr-dedup.c
> >>> +++ b/cr-dedup.c
> >>> @@ -9,6 +9,7 @@
> >>>  
> >>>  #define MAX_BUNCH_SIZE 256
> >>>  
> >>> +/* TODO - patch this for using remote migration using sockets */
> >>>  static int cr_dedup_one_pagemap(int pid);
> >>>  
> >>>  int cr_dedup(void)
> >>> diff --git a/cr-dump.c b/cr-dump.c
> >>> index 3af077b..f8d931a 100644
> >>> --- a/cr-dump.c
> >>> +++ b/cr-dump.c
> >>> @@ -83,6 +83,8 @@
> >>>  
> >>>  #include "asm/dump.h"
> >>>  
> >>> +#include "image-remote.h"
> >>> +
> >>>  static char loc_buf[PAGE_SIZE];
> >>>  
> >>>  static void close_vma_file(struct vma_area *vma)
> >>> @@ -1343,6 +1345,11 @@ int cr_pre_dump_tasks(pid_t pid)
> >>>  	LIST_HEAD(ctls);
> >>>  	struct parasite_ctl *ctl, *n;
> >>>  
> >>> +	if (opts.remote && push_namespace() < 0) {
> >>> +		pr_err("Failed to push image namespace.\n");
> >>> +		goto err;
> >>> +	}
> >>> +
> >>>  	if (!opts.track_mem) {
> >>>  		pr_info("Enforcing memory tracking for pre-dump.\n");
> >>>  		opts.track_mem = true;
> >>> @@ -1448,6 +1455,11 @@ int cr_dump_tasks(pid_t pid)
> >>>  	pr_info("Dumping processes (pid: %d)\n", pid);
> >>>  	pr_info("========================================\n");
> >>>  
> >>> +	if (opts.remote && push_namespace() < 0) {
> >>> +		pr_err("Failed to push image namepsace.\n");
> >>> +		goto err;
> >>> +	}
> >>> +
> >>>  	if (init_stats(DUMP_STATS))
> >>>  		goto err;
> >>>  
> >>> @@ -1550,6 +1562,10 @@ err:
> >>>  	if (disconnect_from_page_server())
> >>>  		ret = -1;
> >>>  
> >>> +	if (opts.remote) {
> >>> +		finish_remote_dump();
> >>
> >> Should the same thing be in cr_pre_dump_tasks()?
> > 
> > No. This call alerts the proxy and cache that no new files will be sent 
> > regarding this process. In pre-dumps we want the restore side to wait for 
> > all the dump files.
> 
> OK
> 
> >>
> >>> +	}
> >>> +
> >>>  	close_cr_imgset(&glob_imgset);
> >>>  
> >>>  	if (bfd_flush_images())
> >>> diff --git a/crtools.c b/crtools.c
> >>> index ea8b889..4f1d2ed 100644
> >>> --- a/crtools.c
> >>> +++ b/crtools.c
> >>> @@ -43,6 +43,8 @@
> >>>  
> >>>  #include "setproctitle.h"
> >>>  
> >>> +#include "image-remote.h"
> >>> +
> >>>  struct cr_options opts;
> >>>  
> >>>  void init_opts(void)
> >>> @@ -62,6 +64,8 @@ void init_opts(void)
> >>>  	opts.cpu_cap = CPU_CAP_DEFAULT;
> >>>  	opts.manage_cgroups = CG_MODE_DEFAULT;
> >>>  	opts.ps_socket = -1;
> >>> +	opts.addr = PROXY_FWD_HOST;
> >>> +	opts.ps_port = CACHE_PUT;
> >>>  	opts.ghost_limit = DEFAULT_GHOST_LIMIT;
> >>>  }
> >>>  
> >>> @@ -252,6 +256,9 @@ int main(int argc, char *argv[], char *envp[])
> >>>  		{ "freeze-cgroup",		required_argument,	0, 1068 },
> >>>  		{ "ghost-limit",		required_argument,	0, 1069 },
> >>>  		{ "irmap-scan-path",		required_argument,	0, 1070 },
> >>> +		{ "remote",			no_argument,		0, 1071 },
> >>> +		{ "image-cache",		no_argument,		0, 1072 },
> >>> +		{ "image-proxy",		required_argument,	0, 1073 },
> >>
> >> These two are not handled in the switch() below.
> > 
> > Sorry, I need help here. I copied the way page-server option is used. What am I 
> > missing?
> 
> This list is where the optional arguments are stored. Those in the command line
> are preceded with the "--". If you don't have --image-cache _option_ (I guess
> you don't it's an action :) ), then these two should simply not be here.

Okey. Thanks :)

> 
> >>
> >>>  		{ },
> >>>  	};
> >>>  
> >>> @@ -494,6 +501,9 @@ int main(int argc, char *argv[], char *envp[])
> >>>  			if (irmap_scan_path_add(optarg))
> >>>  				return -1;
> >>>  			break;
> >>> +		case 1071:
> >>> +			opts.remote = true;
> >>> +			break;
> >>>  		case 'M':
> >>>  			{
> >>>  				char *aux;
> >>> @@ -642,6 +652,12 @@ int main(int argc, char *argv[], char *envp[])
> >>>  	if (!strcmp(argv[optind], "page-server"))
> >>>  		return cr_page_server(opts.daemon_mode, -1) > 0 ? 0 : 1;
> >>>  
> >>> +	if (!strcmp(argv[optind], "image-cache"))
> >>> +		return image_cache(opts.ps_port);
> >>> +
> >>> +	if (!strcmp(argv[optind], "image-proxy"))
> >>> +		return image_proxy(opts.addr, opts.ps_port);
> >>> +
> >>>  	if (!strcmp(argv[optind], "service"))
> >>>  		return cr_service(opts.daemon_mode);
> >>>  
> >>> @@ -668,6 +684,8 @@ usage:
> >>>  "  criu page-server\n"
> >>>  "  criu service [<options>]\n"
> >>>  "  criu dedup\n"
> >>> +"  criu image-cache [<options>]\n"
> >>> +"  criu image-proxy [<options>]\n"
> >>>  "\n"
> >>>  "Commands:\n"
> >>>  "  dump           checkpoint a process/tree identified by pid\n"
> >>> @@ -680,6 +698,8 @@ usage:
> >>>  "  dedup          remove duplicates in memory dump\n"
> >>>  "  cpuinfo dump   writes cpu information into image file\n"
> >>>  "  cpuinfo check  validates cpu information read from image file\n"
> >>> +"  image-cache    launch image-cache, used for process live migration\n"
> >>> +"  image-proxy    launch image-proxy, used for process live migration\n"
> >>>  	);
> >>>  
> >>>  	if (usage_error) {
> >>> @@ -706,6 +726,7 @@ usage:
> >>>  "                        restore making it the parent of the restored process\n"
> >>>  "  --freeze-cgroup\n"
> >>>  "                        use cgroup freezer to collect processes\n"
> >>> +"  --remote              dump images directly to remote node\n"
> >>>  "\n"
> >>>  "* Special resources support:\n"
> >>>  "  -x|--" USK_EXT_PARAM "inode,.." "      allow external unix connections (optionally can be assign socket's inode that allows one-sided dump)\n"
> >>> @@ -762,7 +783,7 @@ usage:
> >>>  "                        when used on restore, as soon as page is restored, it\n"
> >>>  "                        will be punched from the image.\n"
> >>>  "\n"
> >>> -"Page/Service server options:\n"
> >>> +"Page/Service/image-cache/image-proxy server options:\n"
> >>>  "  --address ADDR        address of server or service\n"
> >>>  "  --port PORT           port of page server\n"
> >>>  "  -d|--daemon           run in the background after creating socket\n"
> >>> diff --git a/image-remote.c b/image-remote.c
> >>> new file mode 100644
> >>> index 0000000..79220f3
> >>> --- /dev/null
> >>> +++ b/image-remote.c
> >>> @@ -0,0 +1,258 @@
> >>> +#include <unistd.h>
> >>> +#include <stdlib.h>
> >>> +#include <sys/types.h>
> >>> +#include <sys/socket.h>
> >>> +#include <netinet/in.h>
> >>> +#include <netdb.h>
> >>> +#include "xmalloc.h"
> >>> +#include "criu-log.h"
> >>> +#include "image-remote.h"
> >>> +#include "image-remote-pvt.h"
> >>> +#include "protobuf/remote-image.pb-c.h"
> >>> +#include "protobuf-desc.h"
> >>> +
> >>> +#define PB_REMOTE_IMAGE_SIZE PATHLEN
> >>> +
> >>> +static char** parents = NULL;
> >>> +static int  parents_occ = 0;
> >>> +static char* namespace = NULL;
> >>
> >> This local variable conflicts (not to gcc, but to code reader) with the
> >> function arguments below. Please, give it a better name. The current_namespace
> >> one maybe?
> > 
> > Okey.
> > 
> >>
> >>> +
> >>> +int read_remote_image_connection(char* namespace, char* path)
> >>> +{
> >>> +	int sockfd;
> >>> +	char path_buf[PATHLEN], ns_buf[PATHLEN];
> >>> +
> >>> +	sockfd = setup_UNIX_client_socket(GET_IMG_PATH);
> >>> +	if (sockfd < 0) {
> >>> +	       return -1;
> >>> +	}
> >>> +
> >>> +	if (write_header(sockfd, namespace, path) < 0) {
> >>> +		pr_perror("Error writing header for %s:%s", path, namespace);
> >>> +		return -1;
> >>> +	}
> >>> +
> >>> +	if (read_header(sockfd, ns_buf, path_buf) < 0) {
> >>> +		pr_perror("Error reading header for %s:%s", path, namespace);
> >>> +		return -1;
> >>> +	}
> >>> +
> >>> +	if (!strncmp(path_buf, path, PATHLEN) && !strncmp(ns_buf, namespace, PATHLEN)) {
> >>> +		pr_info("Image cache does have %s:%s\n", path, namespace);
> >>> +		return sockfd;
> >>> +	}
> >>> +	else if (!strncmp(path_buf, DUMP_FINISH, PATHLEN)) {
> >>> +		pr_info("Image cache does not have %s:%s\n", path, namespace);
> >>> +		close(sockfd);
> >>> +		return -1;
> >>> +	}
> >>> +	else {
> >>> +		pr_perror("Image cache returned erroneous name %s\n", path);
> >>> +		close(sockfd);
> >>> +		return -1;
> >>> +	}
> >>> +}
> >>> +
> >>> +int write_remote_image_connection(char* namespace, char* path)
> >>> +{
> >>> +	int sockfd = setup_UNIX_client_socket(PUT_IMG_PATH);
> >>> +	if (sockfd < 0)
> >>> +		return -1;
> >>> +
> >>> +	if (write_header(sockfd, namespace, path) < 0) {
> >>> +		pr_perror("Error writing header for %s:%s", path, namespace);
> >>> +		return -1;
> >>> +	}
> >>> +
> >>> +	pr_info("[write_remote_image_connection] fd=%d\n", sockfd);
> >>> +
> >>> +	return sockfd;
> >>> +}
> >>> +
> >>> +int finish_remote_dump()
> >>> +{
> >>> +	pr_info("Dump side is calling finish\n");
> >>> +	int fd = write_remote_image_connection(NULL_NAMESPACE, DUMP_FINISH);
> >>> +	if (fd == -1) {
> >>> +		pr_perror("Unable to open finish dump connection");
> >>> +		return -1;
> >>> +	}
> >>> +
> >>> +	close(fd);
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +int skip_remote_bytes(int fd, unsigned long len)
> >>> +{
> >>
> >> Indentation got broken in this routine.
> >>
> >>> +    static char buf[4096];
> >>> +    int n = 0;
> >>> +    unsigned long curr = 0;
> >>> +
> >>> +    for(; curr < len; ) {
> >>> +	    n = read(fd, buf, MIN(len - curr, 4096));
> >>
> >> This bytes skipping looks incorrect. The skip_img_bytes() is called for pages.img
> >> on low-level snapshots to correctly forward the file position. Thus, if we get here
> >> with criu restore --remote, this means that the low-level images should be already
> >> here, on the node, and there's no need in reading the data in vain.
> > 
> > Sorry, I couldn't get your idea. This function replaces an lseek inside 
> > skip_pagemap_pages. Since lseek does not work in sockets (as far as I know),
> > I use this function.
> 
> My point is -- you shouldn't get to this place in case of remote connection at all.
> The skip_img_bytes is called in the situation when we have a stack of images and
> we read data from the top-most one and want to skip the duplicate data from the
> lower one(s).
> 
> Next, the restore happens when the whole stack is already obtained and if we get to
> the skip_img_bytes routine, this means that we want to skip bytes from some image
> namespace, excluding the top-most. But the non-top images cannot sit behind the socket,
> they have already been transferred.

Well, I added this method because it was not working and I found this could be a potential
bug of my solution (since lseek does not work for sockets).

I don't see much difference here between files and sockets. In remote mode, we have two
sockets (for example, one from a predump pages-1.img and other from a dump pages-1.img).

These two connections are retrieving data from the local image-cache. If you need to skip
bytes from one pages-1.img, you have to consume them from the socket.

Right?

> 
> >>
> >>> +	    if (n == 0) {
> >>> +		pr_perror("Unexpected end of stream (skipping %lx/%lx bytes)",
> >>> +			curr, len);
> >>> +		return -1;
> >>> +	    }
> >>> +	    else if (n > 0) {
> >>> +		    curr += n;
> >>> +	    }
> >>> +	    else {
> >>> +		pr_perror("Error while skipping bytes from stream (%lx/%lx)",
> >>> +			curr, len);
> >>> +		return -1;
> >>> +	    }
> >>> +    }
> >>> +    if ( curr != len) {
> >>> +	    pr_perror("Unable to skip the current number of bytes: %lx instead of %lx",
> >>> +		    curr, len);
> >>> +	    return -1;
> >>> +    }
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static int write_namespaces()
> >>> +{
> >>> +	int n;
> >>> +	RemoteNamespaceEntry rn = REMOTE_NAMESPACE_ENTRY__INIT;
> >>> +
> >>> +	int sockfd = write_remote_image_connection(NULL_NAMESPACE, PARENT_IMG);
> >>> +	if (sockfd < 0) {
> >>> +		pr_perror("Unable to open namespace push connection");
> >>> +		return -1;
> >>> +	}
> >>> +
> >>> +	rn.n_namespace_ = parents_occ;
> >>> +	rn.namespace_ = xmalloc(sizeof(char*) * rn.n_namespace_);
> >>> +	if (!rn.namespace_) {
> >>> +		pr_perror("Unable to allocate namespace array");
> >>> +		close(sockfd);
> >>> +		return -1;
> >>> +	}
> >>> +
> >>> +	for (n = 0; n < rn.n_namespace_; n++) {
> >>> +		rn.namespace_[n] = xmalloc(sizeof(char) * PATHLEN);
> >>> +		if (!rn.namespace_[n]) {
> >>> +			pr_perror("Unable to allocate namespace");
> >>> +			close(sockfd);
> >>> +			return -1;
> >>> +		}
> >>> +		strncpy(rn.namespace_[n], parents[n], PATHLEN);
> >>> +	}
> >>> +
> >>> +	n = pb_write_obj(sockfd, &rn, PB_REMOTE_NAMESPACE);
> >>> +
> >>> +	for (n = 0; n < rn.n_namespace_; n++)
> >>> +		xfree(rn.namespace_[n]);
> >>> +	xfree(rn.namespace_);
> >>> +
> >>> +	close(sockfd);
> >>> +	return n;
> >>> +}
> >>> +
> >>> +static int read_namespaces()
> >>> +{
> >>> +	int n, sockfd;
> >>> +	RemoteNamespaceEntry* rn;
> >>> +	size_t namespaces = 0;
> >>> +	parents_occ = 0;
> >>> +
> >>> +	sockfd = read_remote_image_connection(NULL_NAMESPACE, PARENT_IMG);
> >>> +	if (sockfd < 0) {
> >>> +		pr_perror("Unable to open namespace get connection");
> >>> +		return -1;
> >>> +	}
> >>> +
> >>> +	n = pb_read_obj(sockfd, (void**)&rn, PB_REMOTE_NAMESPACE);
> >>> +	if (n)
> >>> +		namespaces = rn->n_namespace_;
> >>> +	else if (n < 0) {
> >>> +	    pr_perror("Unable to read remote namepsaces");
> >>> +	    close(sockfd);
> >>> +	    return n;
> >>> +	}
> >>> +
> >>> +	/* The extra char pointer is for the current namespace that is added
> >>> +	 if we are doing a dump or pre-dump operation. */
> >>> +	parents = malloc((namespaces + 1) * sizeof(char*));
> >>> +	if (!parents) {
> >>> +	    pr_perror("Unable to allocate parents");
> >>> +	    close(sockfd);
> >>> +	    return -1;
> >>> +	}
> >>> +
> >>> +	while (parents_occ < namespaces) {
> >>> +	    parents[parents_occ] = malloc(sizeof(char) * PATHLEN);
> >>> +	    if (!parents[parents_occ]) {
> >>> +		    pr_perror("Unable to allocate parent buffer");
> >>> +		    close(sockfd);
> >>> +		    return -1;
> >>> +	    }
> >>> +	    strncpy(parents[parents_occ], rn->namespace_[parents_occ], PATHLEN);
> >>> +	    parents_occ++;
> >>> +	}
> >>> +
> >>> +	free(rn);
> >>> +	close(sockfd);
> >>> +	return parents_occ;
> >>> +}
> >>> +
> >>> +int push_namespace()
> >>> +{
> >>> +    if (read_namespaces() < 0) {
> >>> +	    pr_perror("Failed to push namespace");
> >>> +	    return -1;
> >>> +    }
> >>
> >> You always exchange the full stack of namespace names. What for?
> > 
> > Because I think CRIU dump might need the previous predump images, and
> > the one before the it, and so on. Besides, CRIU restore will need to know 
> > all the history of predumps and dumps. Am I wrong?
> 
> Ah, I see. You need to get the whole stack here, indeed. So you do the
> read stack from proxy/cache, then add one namespace on top, then write
> stack back to proxy/cache. I doubt the latter step is required, the
> proxy/cache already know the whole stack but the last (new) item, why
> not just add a single entry into it?

I can do that. It will complicate a little bit the image-proxy/cache betcause these
components only work in two modes: read and write, not append. :) 

But I will do it.

> 
> And, if we do it, the whole push_namespace() would look like
> 
>    add_namespace(n); /* sends new namespace to proxy/cache */
>    read_namespaces(); /* get back the whole list including the newly added one */
> 
> Does this make sense?

Yes.

> 
> >>
> >>> +
> >>> +    parents[parents_occ] = malloc(sizeof(char) * PATHLEN);
> >>> +    if (!parents[parents_occ]) {
> >>> +	    pr_perror("Unable to allocate parent buffer");
> >>> +	    return -1;
> >>> +    }
> >>> +    strncpy(parents[parents_occ++], namespace, PATHLEN);
> >>> +
> >>> +    if (write_namespaces() < 0) {
> >>> +	    pr_perror("Failed to push namespaces");
> >>> +	    return -1;
> >>> +    }
> >>> +
> >>> +    return parents_occ;
> >>> +}
> >>> +
> >>> +void init_namespace(char* ns)
> >>> +{
> >>> +	namespace = ns;
> >>> +}
> >>> +
> >>> +char* get_current_namespace()
> >>> +{
> >>> +	return namespace;
> >>> +}
> >>> +
> >>> +int get_current_namespace_fd()
> >>
> >> Well, this is not an fd, right? It's an index used for parents[] array.
> >> How about renaming the fd to idx or nr?
> > 
> > Okey.
> > 
> >>
> >> Other than this, this routine is always used before calling the get_namespace()
> >> (except one place, I've commented it below). Why not merge these two into one?
> > 
> > Maybe I can use something like get_current_parent() since this is only used to
> > get or check if the parent namespace exists?
> 
> Yup, looks good.
> 
> >>
> >>> +{
> >>> +	int i = 0;
> >>> +
> >>> +	if (parents_occ == 0 && read_namespaces() < 0)
> >>> +		return -1;
> >>> +
> >>> +	for (; i < parents_occ; i++) {
> >>> +	    if(!strncmp(parents[i], namespace, PATHLEN))
> >>> +		return i;
> >>> +	}
> >>> +	pr_perror("Error, could not find current namespace fd");
> >>> +	return -1;
> >>> +}
> >>> +
> >>> +char* get_namespace(int dfd)
> >>> +{
> >>> +	if (parents_occ == 0 && read_namespaces() < 0) {
> >>> +		pr_perror("No namespace in parent hierarchy");
> >>> +		return NULL;
> >>> +	}
> >>> +	if (dfd >= parents_occ || dfd < 0)
> >>> +		return NULL;
> >>> +	else
> >>> +		return parents[dfd];
> >>> +}
> >>> diff --git a/image.c b/image.c
> >>> index dc9d6a1..9fff629 100644
> >>> --- a/image.c
> >>> +++ b/image.c
> >>> @@ -12,6 +12,7 @@
> >>>  #include "protobuf.h"
> >>>  #include "protobuf/inventory.pb-c.h"
> >>>  #include "protobuf/pagemap.pb-c.h"
> >>> +#include "image-remote.h"
> >>>  
> >>>  bool fdinfo_per_id = false;
> >>>  bool ns_per_id = false;
> >>> @@ -308,16 +309,52 @@ static int do_open_image(struct cr_img *img, int dfd, int type, unsigned long of
> >>>  
> >>>  	flags = oflags & ~(O_NOBUF | O_SERVICE);
> >>>  
> >>> -	ret = openat(dfd, path, flags, CR_FD_PERM);
> >>> -	if (ret < 0) {
> >>> -		if (!(flags & O_CREAT) && (errno == ENOENT)) {
> >>> -			pr_info("No %s image\n", path);
> >>> +	if(opts.remote && (imgset_template[type].magic != STATS_MAGIC)) {
> >>
> >> Please, introduce the O_FORCE_LOCAL flag and put in into template in proper
> >> places. I also have a feeling that not only stats should be such, but the
> >> irmap cache too.
> > 
> > Right. Okey.
> > 
> >>
> >>> +		char* namespace = NULL;
> >>> +
> >>> +		/* if dfd is the service fd or is unset then we need the current
> >>> +		 * namesapce. Else we need a parent namespace. */
> >>
> >> I don't get this. The dfd is always a directory with images, why can it ever
> >> be -1?
> > 
> > Maybe I got it wrong, but I had the idea that the dfd could get here unset (-1).
> 
> Well, the -1 can happen into open_image(), but the latter would replace
> it with service fd. The do_open_iamge() always work on some valid fd.
> 
> >>
> >> Also you seem to mix the directory file descriptors with indices in remote
> >> namespaces stack. What's the guarantee that those will never intersect?
> > 
> > The service_fd is always a very high number, far away from application fds.
> > Namespace indexes start from zero. Therefore, I think it is very unlikely that
> > they will interset.
> 
> Oh... This should be documented as code comment in the place where you install
> the index instead of a real file descriptor into service fd.

Okey.

> 
> >>
> >>> +		if (dfd == get_service_fd(IMG_FD_OFF) || dfd == -1)
> >>> +			namespace = get_current_namespace();
> >>> +		else
> >>> +			namespace = get_namespace(dfd);
> >>> +
> >>> +		/* TODO - Find out what is the best solution for this file. */
> >>> +		if (imgset_template[type].magic == IRMAP_CACHE_MAGIC) {
> >>> +			ret = -1;
> >>> +		}
> >>> +		else if(namespace == NULL) {
> >>> +			ret = -1;
> >>> +		}
> >>> +		else if (flags == O_RDONLY) {
> >>> +			pr_info("do_open_remote_image RDONLY path=%s namespace=%s\n",
> >>> +			  path, namespace);
> >>> +			ret = read_remote_image_connection(namespace, path);
> >>> +		}
> >>> +		else {
> >>> +			pr_info("do_open_remote_image WDONLY path=%s namespace=%s\n",
> >>> +			  path, namespace);
> >>> +			ret = write_remote_image_connection(namespace, path);
> >>> +		}
> >>> +
> >>> +		if (ret < 0) {
> >>> +			pr_info("No %s (dfd=%d) image\n", path, dfd);
> >>>  			img->_x.fd = EMPTY_IMG_FD;
> >>>  			goto skip_magic;
> >>>  		}
> >>> -
> >>> -		pr_perror("Unable to open %s", path);
> >>> -		goto err;
> >>> +	}
> >>> +	else {
> >>> +		ret = openat(dfd, path, flags, CR_FD_PERM);
> >>> +		if (ret < 0) {
> >>> +			if (!(flags & O_CREAT) && (errno == ENOENT)) {
> >>> +				pr_info("No %s image\n", path);
> >>> +				img->_x.fd = EMPTY_IMG_FD;
> >>> +				goto skip_magic;
> >>> +			}
> >>> +
> >>> +			pr_perror("Unable to open %s", path);
> >>> +			goto err;
> >>> +		}
> >>>  	}
> >>>  
> >>>  	img->_x.fd = ret;
> >>> @@ -410,7 +447,10 @@ int open_image_dir(char *dir)
> >>>  	close(fd);
> >>>  	fd = ret;
> >>>  
> >>> -	if (opts.img_parent) {
> >>> +	if (opts.remote) {
> >>> +		init_namespace(dir);
> >>> +	}
> >>> +	else if (opts.img_parent) {
> >>>  		ret = symlinkat(opts.img_parent, fd, CR_PARENT_LINK);
> >>>  		if (ret < 0 && errno != EEXIST) {
> >>>  			pr_perror("Can't link parent snapshot");
> >>> diff --git a/include/cr_options.h b/include/cr_options.h
> >>> index af130dd..4c611ee 100644
> >>> --- a/include/cr_options.h
> >>> +++ b/include/cr_options.h
> >>> @@ -91,6 +91,7 @@ struct cr_options {
> >>>  	bool			enable_external_masters;
> >>>  	bool			aufs;		/* auto-deteced, not via cli */
> >>>  	bool			overlayfs;
> >>> +	bool			remote;
> >>>  	size_t			ghost_limit;
> >>>  	struct list_head	irmap_scan_paths;
> >>>  };
> >>> diff --git a/include/image-remote.h b/include/image-remote.h
> >>> new file mode 100644
> >>> index 0000000..20e0d36
> >>> --- /dev/null
> >>> +++ b/include/image-remote.h
> >>> @@ -0,0 +1,87 @@
> >>> +#include <limits.h>
> >>> +
> >>> +#ifndef IMAGE_REMOTE_H
> >>> +#define	IMAGE_REMOTE_H
> >>> +
> >>> +#define PATHLEN PATH_MAX
> >>> +#define DUMP_FINISH "DUMP_FINISH"
> >>> +#define PARENT_IMG "parent"
> >>> +#define NULL_NAMESPACE "null"
> >>> +
> >>> +/* This flag is used to enable local debugging (dump + proxy + cache + restore)
> >>> + * on the same machine. The idea is that both dump and restore processes are
> >>> + * orthogonal to this. */
> >>> +#define LOCAL_DEVEL 1
> >>> +
> >>> +/* Read image connections are sent to GET_IMG_PATH (local UNIX socket). Write
> >>> + * image connections are sent to PUT_IMG_PATH. Both image-proxy and image-cache
> >>> + * need to accept both read and write connections. All connetions are local (i.e.
> >>> + * using UNIX sockets) except for write connections to image-cache. These are
> >>> + * used to migrate image files from the image-proxy to the image-cache. These
> >>> + * connections (one per image file) use TCP. */
> >>> +
> >>> +#define GET_IMG_PATH "/tmp/criu-remote-get.sock"
> >>> +#define PUT_IMG_PATH "/tmp/criu-remote-put.sock"
> >>> +
> >>> +#define PROXY_GET LOCAL_DEVEL ? "/tmp/criu-proxy-get.sock" : GET_IMG_PATH
> >>> +#define PROXY_PUT PUT_IMG_PATH
> >>> +
> >>> +#define CACHE_GET GET_IMG_PATH
> >>> +/* Only the this channel is TCP.*/
> >>> +#define CACHE_PUT 9996
> >>> +
> >>> +#define PROXY_FWD_PORT CACHE_PUT
> >>> +#define PROXY_FWD_HOST "localhost"
> >>> +
> >>> +/* Warning: This may be problematic because of double evaluation... */
> >>> +#define MIN(x, y) (((x) < (y)) ? (x) : (y))
> >>> +
> >>> +/* Called by restore to get the fd correspondent to a particular path. This call
> >>> + * will block until the connection is received. */
> >>> +extern int read_remote_image_connection(char* namespace, char* path);
> >>> +
> >>> +/* Called by dump to create a socket connection to the restore side. The socket
> >>> + * fd is returned for further writing operations. */
> >>> +extern int write_remote_image_connection(char* namespace, char* path );
> >>> +
> >>> +/* Called by dump when everything is dumped. This function creates a new
> >>> + * connection with a special control name. The recover side uses it to ack that
> >>> + * no more files are coming. */
> >>> +extern int finish_remote_dump();
> >>> +
> >>> +/* Starts an image proxy daemon (dump side). It receives image files through
> >>> + * socket connections and forwards them to the image cache (restore side). */
> >>> +extern int image_proxy(char* cache_host, unsigned short cache_port);
> >>> +
> >>> +/* Starts an image cache daemon (restore side). It receives image files through
> >>> + * socket connections and caches them until they are requested by the restore
> >>> + * process. */
> >>> +extern int image_cache(unsigned short cache_port);
> >>> +
> >>> +/* Reads (discards) 'len' bytes from fd. This is used to emulate the function
> >>> + * lseek, which is used to advance the file needle. */
> >>> +int skip_remote_bytes(int fd, unsigned long len);
> >>> +
> >>> +/* To support iterative migration (multiple pre-dumps before the final dump
> >>> + * and subsequent restore, the concept of namespace is introduced. Each image
> >>> + * is tagged with one namespace and we build a hierarchy of namespaces to
> >>> + * represent the dependency between pagemaps. Currently, the images dir is
> >>> + * used as namespace when the operation is marked as remote. */
> >>> +
> >>> +/* Sets the current namespace */
> >>> +void init_namespace(char* ns);
> >>> +
> >>> +/* Returns a pointer to the char array containing the current namesapce. */
> >>> +char* get_current_namespace();
> >>> +
> >>> +/* Returns an integer (virtual fd) representing the current namespace. */
> >>> +int get_current_namespace_fd();
> >>> +
> >>> +/* Returns the namespace associated with the virtual fd (given as argument). */
> >>> +char* get_namespace(int dfd);
> >>> +
> >>> +/* Pushes the current namespace into the namespace hierarchy. The hierarchy is
> >>> + * read, modified, and written. */
> >>> +int push_namespace();
> >>> +
> >>> +#endif	/* IMAGE_REMOTE_H */
> >>> diff --git a/include/protobuf-desc.h b/include/protobuf-desc.h
> >>> index ab7e4f2..232b814 100644
> >>> --- a/include/protobuf-desc.h
> >>> +++ b/include/protobuf-desc.h
> >>> @@ -55,6 +55,8 @@ enum {
> >>>  	PB_CPUINFO,
> >>>  	PB_USERNS,
> >>>  	PB_NETNS,
> >>> +	PB_REMOTE_IMAGE,
> >>> +	PB_REMOTE_NAMESPACE,
> >>>  
> >>>  	/* PB_AUTOGEN_STOP */
> >>>  
> >>> diff --git a/page-read.c b/page-read.c
> >>> index 832c057..e690354 100644
> >>> --- a/page-read.c
> >>> +++ b/page-read.c
> >>> @@ -10,6 +10,8 @@
> >>>  #include "protobuf.h"
> >>>  #include "protobuf/pagemap.pb-c.h"
> >>>  
> >>> +#include "image-remote.h"
> >>> +
> >>>  #ifndef SEEK_DATA
> >>>  #define SEEK_DATA	3
> >>>  #define SEEK_HOLE	4
> >>> @@ -90,7 +92,12 @@ static void skip_pagemap_pages(struct page_read *pr, unsigned long len)
> >>>  		return;
> >>>  
> >>>  	pr_debug("\tpr%u Skip %lx bytes from page-dump\n", pr->id, len);
> >>> -	if (!pr->pe->in_parent)
> >>> +	if (!pr->pe->in_parent && opts.remote) {
> >>> +		if (skip_remote_bytes(img_raw_fd(pr->pi), len) < 0) {
> >>> +			pr_perror("Unable to skip remote bytes");
> >>> +		}
> >>> +	}
> >>> +	else if (!pr->pe->in_parent)
> >>>  		lseek(img_raw_fd(pr->pi), len, SEEK_CUR);
> >>>  	pr->cvaddr += len;
> >>>  }
> >>> @@ -132,6 +139,20 @@ new_pagemap:
> >>>  	}
> >>>  }
> >>>  
> >>> +/* This read_ is a stronger way to read something from a fd. */
> >>> +static size_t read_(int fd, char* buff, size_t size)
> >>
> >> Such things should go to util.c file.
> > 
> > Okey.
> > 
> >>
> >>> +{
> >>> +	size_t n = 0;
> >>> +	size_t curr = 0;
> >>> +	while(1) {
> >>> +		n  = read(fd, buff + curr, size - curr);
> >>> +		if(n < 1)
> >>> +			return n;
> >>> +		curr += n;
> >>> +		if(curr == size)
> >>> +			return size;
> >>> +	}
> >>> +}
> >>>  static int read_pagemap_page(struct page_read *pr, unsigned long vaddr, void *buf)
> >>>  {
> >>>  	int ret;
> >>> @@ -146,10 +167,11 @@ static int read_pagemap_page(struct page_read *pr, unsigned long vaddr, void *bu
> >>>  			return ret;
> >>>  	} else {
> >>>  		int fd = img_raw_fd(pr->pi);
> >>> -		off_t current_vaddr = lseek(fd, 0, SEEK_CUR);
> >>> +		/* TODO - lseek is not possible to sockets. Need to find a solution. */
> >>> +		off_t current_vaddr = opts.remote ? 0 : lseek(fd, 0, SEEK_CUR);
> >>>  		pr_debug("\tpr%u Read page %lx from self %lx/%"PRIx64"\n", pr->id,
> >>>  				vaddr, pr->cvaddr, current_vaddr);
> >>> -		ret = read(fd, buf, PAGE_SIZE);
> >>> +		ret = read_(fd, buf, PAGE_SIZE);
> >>>  		if (ret != PAGE_SIZE) {
> >>>  			pr_perror("Can't read mapping page %d", ret);
> >>>  			return -1;
> >>> @@ -195,9 +217,18 @@ static int try_open_parent(int dfd, int pid, struct page_read *pr, int pr_flags)
> >>>  	int pfd, ret;
> >>>  	struct page_read *parent = NULL;
> >>>  
> >>> -	pfd = openat(dfd, CR_PARENT_LINK, O_RDONLY);
> >>> -	if (pfd < 0 && errno == ENOENT)
> >>> -		goto out;
> >>> +	if(opts.remote) {
> >>> +		/* dfd is either the service fd or an image namespace */
> >>> +		pfd = dfd == get_service_fd(IMG_FD_OFF) ? get_current_namespace_fd() : dfd;
> >>> +		pfd -= 1;
> >>> +		if(get_namespace(pfd) == NULL)
> >>
> >> This is the place where get_current_namespace_fd() can be used independently from
> >> get_namespace(). Can you eplain this place, please?
> > 
> > This call to get_namespace(pfd) checks if the namespace pdf exists. In other words, 
> > this is checking if there is any parent of the current namespace. The lower the pfd
> > is, the older snapshot we are looking for.
> > 
> > This place is different from the other places where this function is called because
> > we have dfd, which can already point to a parent namespace. Therefore I cannot simply 
> > use something like get_current_parent.
> 
> Hmm, why not, just for the lower-most lavel you can report "parent is OK", since the
> restult of this function is not used as index or anything like that.

Okey, I will try to simplify this.

> 
> >>
> >>> +			goto out;
> >>> +	}
> >>> +	else {
> >>> +		pfd = openat(dfd, CR_PARENT_LINK, O_RDONLY);
> >>> +		if (pfd < 0 && errno == ENOENT)
> >>> +			goto out;
> >>> +	}
> >>>  
> >>>  	parent = xmalloc(sizeof(*parent));
> >>>  	if (!parent)
> >>> @@ -211,8 +242,8 @@ static int try_open_parent(int dfd, int pid, struct page_read *pr, int pr_flags)
> >>>  		xfree(parent);
> >>>  		parent = NULL;
> >>>  	}
> >>> -
> >>> -	close(pfd);
> >>> +	if(!opts.remote)
> >>> +		close(pfd);
> >>>  out:
> >>>  	pr->parent = parent;
> >>>  	return 0;
> >>> @@ -220,7 +251,8 @@ out:
> >>>  err_free:
> >>>  	xfree(parent);
> >>>  err_cl:
> >>> -	close(pfd);
> >>> +	if(!opts.remote)
> >>> +		close(pfd);
> >>>  	return -1;
> >>>  }
> >>>  
> >>> diff --git a/page-xfer.c b/page-xfer.c
> >>> index 7465ed8..d2a49de 100644
> >>> --- a/page-xfer.c
> >>> +++ b/page-xfer.c
> >>> @@ -17,6 +17,8 @@
> >>>  #include "protobuf.h"
> >>>  #include "protobuf/pagemap.pb-c.h"
> >>>  
> >>> +#include "image-remote.h"
> >>> +
> >>>  struct page_server_iov {
> >>>  	u32	cmd;
> >>>  	u32	nr_pages;
> >>> @@ -742,12 +744,22 @@ static int open_page_local_xfer(struct page_xfer *xfer, int fd_type, long id)
> >>>  		int ret;
> >>>  		int pfd;
> >>>  
> >>> -		pfd = openat(get_service_fd(IMG_FD_OFF), CR_PARENT_LINK, O_RDONLY);
> >>> -		if (pfd < 0 && errno == ENOENT)
> >>> -			goto out;
> >>> +		if (opts.remote) {
> >>> +			pfd = get_current_namespace_fd() - 1;
> >>> +			if (get_namespace(pfd) == NULL)
> >>> +				goto out;
> >>> +		}
> >>> +		else {
> >>> +			pfd = openat(get_service_fd(IMG_FD_OFF), CR_PARENT_LINK, O_RDONLY);
> >>> +			if (pfd < 0 && errno == ENOENT)
> >>> +				goto out;
> >>> +		}
> >>>  
> >>>  		xfer->parent = xmalloc(sizeof(*xfer->parent));
> >>> -		if (!xfer->parent) {
> >>> +		if (!xfer->parent && opts.remote) {
> >>> +			return -1;
> >>> +		}
> >>> +		else if (!xfer->parent) {
> >>>  			close(pfd);
> >>>  			return -1;
> >>>  		}
> >>> @@ -757,7 +769,8 @@ static int open_page_local_xfer(struct page_xfer *xfer, int fd_type, long id)
> >>>  			pr_perror("No parent image found, though parent directory is set");
> >>>  			xfree(xfer->parent);
> >>>  			xfer->parent = NULL;
> >>> -			close(pfd);
> >>> +			if(!opts.remote)
> >>> +				close(pfd);
> >>>  			goto out;
> >>>  		}
> >>>  		close(pfd);
> >>> @@ -807,6 +820,12 @@ int check_parent_local_xfer(int fd_type, int id)
> >>>  }
> >>>  
> >>> +int check_parent_remote_xfer()
> >>> +{
> >>> +	int pfd = get_current_namespace_fd() - 1;
> >>> +	return get_namespace(pfd) == NULL ? 0 : 1;
> >>> +}
> >>> +
> >>>  static int page_server_check_parent(int sk, struct page_server_iov *pi)
> >>>  {
> >>>  	int type, ret;
> >>> @@ -852,6 +871,8 @@ int check_parent_page_xfer(int fd_type, long id)
> >>>  {
> >>>  	if (opts.use_page_server)
> >>>  		return check_parent_server_xfer(fd_type, id);
> >>> +	else if (opts.remote)
> >>> +		return check_parent_remote_xfer();
> >>>  	else
> >>>  		return check_parent_local_xfer(fd_type, id);
> >>>  }
> >>> diff --git a/protobuf-desc.c b/protobuf-desc.c
> >>> index 873fd3b..2b58aab 100644
> >>> --- a/protobuf-desc.c
> >>> +++ b/protobuf-desc.c
> >>> @@ -61,6 +61,7 @@
> >>>  #include "protobuf/timerfd.pb-c.h"
> >>>  #include "protobuf/cpuinfo.pb-c.h"
> >>>  #include "protobuf/userns.pb-c.h"
> >>> +#include "protobuf/remote-image.pb-c.h"
> >>>  
> >>>  struct cr_pb_message_desc cr_pb_descs[PB_MAX];
> >>>  
> >>> diff --git a/protobuf/Makefile b/protobuf/Makefile
> >>> index 0b11852..f685fc6 100644
> >>> --- a/protobuf/Makefile
> >>> +++ b/protobuf/Makefile
> >>> @@ -48,6 +48,7 @@ proto-obj-y	+= tty.o
> >>>  proto-obj-y	+= file-lock.o
> >>>  proto-obj-y	+= rlimit.o
> >>>  proto-obj-y	+= pagemap.o
> >>> +proto-obj-y	+= remote-image.o
> >>>  proto-obj-y	+= siginfo.o
> >>>  proto-obj-y	+= rpc.o
> >>>  proto-obj-y	+= ext-file.o
> >>> diff --git a/protobuf/remote-image.proto b/protobuf/remote-image.proto
> >>> new file mode 100644
> >>> index 0000000..d2fad2e
> >>> --- /dev/null
> >>> +++ b/protobuf/remote-image.proto
> >>> @@ -0,0 +1,8 @@
> >>> +message remote_image_entry {
> >>> +	required string name		= 1;
> >>> +	required string namespace	= 2;
> >>> +}
> >>> +
> >>> +message remote_namespace_entry {
> >>> +	repeated string namespace	= 1;
> >>> +}
> >>>
> >>
> >> -- Pavel
> > 
> > 
> 


-- 
Rodrigo Bruno <rbruno at gsd.inesc-id.pt>


More information about the CRIU mailing list