[CRIU] Process Migration Using Sockets - PATCH

Tue Sep 8 17:30:41 PDT 2015

On Tue, 8 Sep 2015 17:55:21 +0300
Pavel Emelyanov <xemul at parallels.com> wrote:

> On 09/08/2015 01:47 AM, Rodrigo Bruno wrote:
> > Hi, 
> > 
> > I will answer inline, its easier.
> > 
> > On Mon, 7 Sep 2015 13:34:54 +0300
> > Pavel Emelyanov <xemul at parallels.com> wrote:
> > 
> >> On 09/02/2015 11:24 PM, Rodrigo Bruno wrote:
> >>
> >> Ridrigo, thanks for the patch. Sorry for the late response, I was busy with the
> >> 1.7 release. Now it's finished and we have some time for new cool features :)
> >>
> >> Find my comments inline.
> >>
> >>> The patch is listed below. The idea is to migrate processes without using disk-backed
> >>> images. Files used by these processes still need to be shared (NFS for example) to 
> >>> enable full live migration. In future these files could also be transferred using 
> >>> sockets.
> >>>
> >>> Two new entities are introduced: the image-proxy, and the image-cache. The image-proxy
> >>> receives the image files from the dump process and forwards them to the image-cache. 
> >>> The image-cache waits for requests from  the restore process.
> >>
> >> Can you shed a little bit more light on this: what's the way image-proxy gets images
> >> from criu dump and what's the way criu restore gets images from image-cache? Are these
> >> just unix sockets?
> > 
> > Yes. I patch (introduce an if) right before the image open. If the "opts.remote" is set
> > (this is a command line arg), I redirect the open to my new functions. Then I simply open
> > a socket and return the file descriptor (which is then used by the existing code). 
> 
> OK, so image-proxy is a multiplexer that merges several image flows into one
> network flow. This is clear.
> 
> The image-cache as I see it a page-server++ that accepts not only pages images,
> but arbitrary ones. Then it feeds the images via socket into criu restore. So
> my question is -- why not make image-cache put images into tmpfs mount and then
> use regular criu restore?

It could be done. However, 
1) the modifications introduced in "open_image_at" (image.c) to open remote images 
instead of disk-backed also work for the restore side, so no more modifications needed; 
2) the image-cache and image-proxy share most of the code, only a couple of lines differ. 
The two components (cache and proxy) are very similar,being the only difference the fact 
that the proxy forwards the images. Both components have one port for receiving and 
caching images and one port for sending the cached images (the proxy needs to cache 
images so that dumps/pre-dumps that need to open previous pagemaps and pages are able 
to do so).

> 
> >>
> >>> Example:
> >>>
> >>> Target Node:
> >>> criu image-cache -vvv -o /tmp/image-cache.log --port <cache port> < /dev/null &
> >>> sudo criu restore -D /tmp/dump -d -vvvv -o /tmp/restore.log  --remote && echo OK
> >>>
> >>> Source Node:
> >>> criu image-proxy -vvv -o /tmp/image-proxy.log --port <cache port> --address <target node> < /dev/null &
> >>> sudo criu pre-dump -D /tmp/pre-dump -d -vvvv -o /tmp/pre-dump.log -t $pid --remote
> >>> sudo criu dump -D /tmp/dump -d -vvvv -o /tmp/dump.log -t $pid --remote  --prev-images-dir /tmp/pre-dump --track-mem
> >>>
> >>> The code is also available at https://github.com/rodrigo-bruno/criu (forked from CRIU).
> >>>
> >>> You can also test it locally. I have been using this to migrate OpenJDK processes.
> >>> If you ever decide to use this code, I would be glad to help, provide bug fixes, etc.
> >>
> >> I'd also appreciate if you split the patch into set. First there must go changes in
> >> the existing criu code that prepare one for easier further patches, then the component
> >> by component new stuff. E.g. first goes image-cache, then image-proxy, then changes
> >> in the dump code to support proxy, then changes in the restorer code to support cache.
> > 
> > Okey, I can do that.
> 
> Awesome!
> 
> >>>  int cr_dedup(void)
> >>>  {
> >>>  	int close_ret, ret = 0;
> >>> diff -uprN criu-source/cr-dump.c criu-patch/cr-dump.c
> >>> --- criu-source/cr-dump.c	2015-09-01 20:34:37.050773528 +0100
> >>> +++ criu-patch/cr-dump.c	2015-09-02 02:37:15.993970004 +0100
> >>> @@ -1550,6 +1552,10 @@ err:
> >>>  	if (disconnect_from_page_server())
> >>>  		ret = -1;
> >>>  
> >>> +        if (opts.remote) {
> >>
> >> Something has happened with tab indentation.
> >>
> >>> +            finish_remote_dump();
> >>> +        }
> >>> +
> >>>  	close_cr_imgset(&glob_imgset);
> >>>  
> >>>  	if (bfd_flush_images())
> >>
> >>> diff: criu-source/crtools: No such file or directory
> >>> diff: criu-patch/crtools: No such file or directory
> >>> diff -uprN criu-source/crtools.c criu-patch/crtools.c
> >>> --- criu-source/crtools.c	2015-09-01 20:34:37.054773617 +0100
> >>> +++ criu-patch/crtools.c	2015-09-02 03:05:47.229581153 +0100
> >>> @@ -42,6 +42,8 @@
> >>>  
> >>>  #include "setproctitle.h"
> >>>  
> >>> +#include "image-remote.h"
> >>> +
> >>>  struct cr_options opts;
> >>>  
> >>>  void init_opts(void)
> >>> @@ -60,6 +62,8 @@ void init_opts(void)
> >>>  	opts.cpu_cap = CPU_CAP_DEFAULT;
> >>>  	opts.manage_cgroups = CG_MODE_DEFAULT;
> >>>  	opts.ps_socket = -1;
> >>> +	opts.addr = PROXY_FWD_HOST;
> >>> +	opts.ps_port = CACHE_PUT_PORT;
> >>
> >> You reuse the existing opts fields. How would this correlate with the --page-server
> >> code?
> > 
> > Well, I didn't use the page server. These options are reused just for simplicity. 
> > When I use a remote dump or remote restore, I assume that the page server is not
> > being used, otherwise this would be a problem because they both use the same opts
> > fields.
> 
> OK, then if we go forward with new options we should first make sure that 
> page server cannot be used together and start deprecating it. This would 
> imply updating at least p.haul and docs.
> 
> >>
> >>>  	opts.ghost_limit = DEFAULT_GHOST_LIMIT;
> >>>  }
> >>>  
> >>
> >>> diff -uprN criu-source/image.c criu-patch/image.c
> >>> --- criu-source/image.c	2015-09-01 20:34:37.058773708 +0100
> >>> +++ criu-patch/image.c	2015-09-02 02:57:48.502419478 +0100
> >>> @@ -336,6 +347,72 @@ static int do_open_image(struct cr_img *
> >>>  	if (imgset_template[type].magic == RAW_IMAGE_MAGIC)
> >>>  		goto skip_magic;
> >>>  
> >>> +	if (flags == O_RDONLY) {
> >>> +		ret = img_check_magic(img, oflags, type, path);
> >>> +        }
> >>> +	else {
> >>> +		ret = img_write_magic(img, oflags, type);
> >>> +        }
> >>> +	if (ret)
> >>> +		goto err;
> >>> +
> >>> +skip_magic:
> >>> +	return 0;
> >>> +
> >>> +err:
> >>> +	return -1;
> >>> +}
> >>> +
> >>> +static int do_open_remote_image(struct cr_img *img, int dfd, int type, unsigned long oflags, char *path)
> >>> +{
> >>> +	int ret, flags;
> >>> +
> >>> +	flags = oflags & ~(O_NOBUF | O_SERVICE);
> >>> +        
> >>> +        if(dfd == get_service_fd(IMG_FD_OFF) || dfd == -1)
> >>> +            dfd = get_current_namespace_fd();
> >>
> >> I didn't quite get the idea of namespaces. Can you descibe it in more details, please?
> > 
> > Namespaces are used to tag images according to their hierarchy. When you use pre-dumps, you
> > create a symlink ("parent") pointing to the directory where the pre-dump was stored.
> 
> Where previous pre-dumps were stored, yes. Also note, that parent can point to full
> dumps as well.
> 
> > This
> > creates a relation between the current dump/pre-dump and the previous one. You then open this
> > link and use the file descriptor to continue operating (reading the pagemap of the previous
> > pre-dump for example).
> > 
> > I can't do this because images are stored in the image-proxy (memory cache). I solve this in 
> > a kind of naive way. When a dump/pre-dump starts, I take the image working dir (given by the 
> > user) and tag all images produced with the namespace (the working dir). When I see that a 
> > dump/pre-dump is reffering a previous image dir I inform the image proxy that the current 
> > working dir is a child of the one given as "--prev-images-dir". 
> 
> Ah, so the namespace is just a image directory name used to tag images in cache, right?

Yes.

> 
> > Then, when criu tries to open the "parent" link, it will (only when it is in remote mode) an 
> > identifier (not a real descriptor) that represents the parent namespace. This identifier is 
> > used to indicate the desired namespace whe the image is openned.
> > 
> >>
> >>> +        
> >>> +        // TODO - fix this. Find out what is the purpose of this file.
> >>> +        if(!strcmp("irmap-cache", path)) {
> >>> +            ret = -1;
> >>> +        }
> >>> +        else if(get_namespace(dfd) == NULL) {
> >>> +            ret = -1;
> >>> +        }
> >>> +        else if (flags == O_RDONLY) {
> >>> +            pr_info("do_open_remote_image RDONLY path=%s namespace=%s\n", 
> >>> +                    path, get_namespace(dfd));
> >>> +            ret = get_remote_image_connection(get_namespace(dfd), path);
> >>> +        }
> >>> +        else {
> >>> +            pr_info("do_open_remote_image WDONLY path=%s namespace=%s\n", 
> >>> +                    path, get_namespace(dfd));
> >>> +            ret = open_remote_image_connection(get_namespace(dfd), path);
> >>> +        }
> >>> +        
> >>> +        if (ret < 0) {
> >>> +            pr_info("No %s (dfd=%d) image\n", path, dfd);
> >>> +            img->_x.fd = EMPTY_IMG_FD;
> >>> +            goto skip_magic;
> >>> +	}
> >>> +        
> >>> +
> >>> +	img->_x.fd = ret;
> >>> +	if (oflags & O_NOBUF)
> >>> +		bfd_setraw(&img->_x);
> >>> +	else {
> >>> +		if (flags == O_RDONLY)
> >>> +			ret = bfdopenr(&img->_x);
> >>> +		else
> >>> +			ret = bfdopenw(&img->_x);
> >>> +
> >>> +		if (ret)
> >>> +			goto err;
> >>> +	}
> >>> +
> >>> +	if (imgset_template[type].magic == RAW_IMAGE_MAGIC)
> >>> +		goto skip_magic;
> >>> +
> >>>  	if (flags == O_RDONLY)
> >>>  		ret = img_check_magic(img, oflags, type, path);
> >>>  	else
> >>
> >>> diff -uprN criu-source/image-remote.c criu-patch/image-remote.c
> >>> --- criu-source/image-remote.c	1970-01-01 01:00:00.000000000 +0100
> >>> +++ criu-patch/image-remote.c	2015-09-02 02:18:33.548099686 +0100
> >>> @@ -0,0 +1,281 @@
> >>> +#include <unistd.h>
> >>> +#include <stdlib.h>
> >>> +#include <sys/types.h> 
> >>> +#include <sys/socket.h>
> >>> +#include <netinet/in.h>
> >>> +#include <netdb.h>
> >>> +
> >>> +#include <pthread.h>
> >>> +#include <semaphore.h>
> >>> +
> >>> +#include "criu-log.h"
> >>> +#include "image-remote.h"
> >>> +
> >>> +// TODO - fix space limitation
> >>> +static char parents[PATHLEN][PATHLEN]; 
> >>> +static int  parents_occ = 0;
> >>> +static char* namespace = NULL;
> >>> +// TODO - not used for now. It will be used if we implement a shared cache and proxy.
> >>> +static char* parent = NULL; 
> >>> +
> >>> +int setup_local_client_connection(int port) 
> >>> +{
> >>> +        int sockfd;
> >>> +        struct sockaddr_in serv_addr;
> >>> +        struct hostent *server;
> >>> +
> >>> +        sockfd = socket(AF_INET, SOCK_STREAM, 0);
> >>> +        if (sockfd < 0) {
> >>> +                pr_perror("Unable to open remote image socket to img cache");
> >>> +                return -1;
> >>> +        }
> >>> +
> >>> +        server = gethostbyname(DEFAULT_HOST);
> >>> +        if (server == NULL) {
> >>> +                pr_perror("Unable to get host by name (%s)", DEFAULT_HOST);
> >>> +                return -1;
> >>> +        }
> >>> +
> >>> +        bzero((char *) &serv_addr, sizeof (serv_addr));
> >>> +        serv_addr.sin_family = AF_INET;
> >>> +        bcopy((char *) server->h_addr,
> >>> +              (char *) &serv_addr.sin_addr.s_addr,
> >>> +              server->h_length);
> >>> +        serv_addr.sin_port = htons(port);
> >>> +
> >>> +        if (connect(sockfd, (struct sockaddr *) &serv_addr, sizeof(serv_addr)) < 0) {
> >>> +                pr_perror("Unable to connect to remote restore host %s", DEFAULT_HOST);
> >>> +                return -1;
> >>> +        }
> >>> +
> >>> +        return sockfd;
> >>> +}
> >>> +
> >>> +int write_header(int fd, char* namespace, char* path)
> >>
> >> You seem to be using text-based protocol for images transfers, don't you?
> > 
> > Yes... When I read or write an image I always write the name and namespace of the 
> > image before its actual content. This is used to identify the correct images when
> > they are requested later on.
> 
> Can you describe your protocol then in more details. Why I don't quite understand
> is how you mix text info with binary data (images and pages) and how you define
> borders between objects.

Yes. Two communications can exist: writing and reading remote images. They both use the
same protocol:

1. open socket on read or write port (both cache and proxy have one of each)
2. write image pathname (32 bytes)
3. write image namespace (32 bytes)
4. read image pathname (32 bytes)
5. read image namesapce (32 bytes)

The 32 bytes is a constant. It is a limitation on the size of the names and namespaces. It 
could be solved by adding a first field (size).

Steps 4 and 5 are used to check if the image exists (if it doesn't, an error is read from the 
socket).

Then, I simply return the socket file descriptor and let criu use it for writing (dump) or 
reading (restore).

Note that I do not unpack the objects being sent (pagemap entries for example). Neither I 
check if the file was correctly sent. When the socket FD is closed I assume that the 
operation (reading or writing) is complete and I let the other side (restore for example)
to check if the image is correct.

> 
> Another thing about this protocol -- in page-server protocol we use hand-made binary
> headers (packed structs) that precede individual extents (sets of pages). Now I think
> it was mistake to use fully hand-made protocol. Why not stick with some more standard
> one, at least use protobuf-compiled headers. This would help us to at least solve
> the extendability problem.

I agree that hand-made protocols are usually a bad idea compared to using standard 
approaches. Including the name and namespace of the images inside the actual
content of the images would be enough to avoid this hand-made protocol. Then I would
only need to read the header.

> 
> >>
> >>> +{
> >>> +        if (write(fd, path, PATHLEN) < 1) {
> >>> +                pr_perror("Unable to send path to remote image connection");
> >>> +                return -1;
> >>> +        }
> >>> +
> >>> +        if (write(fd, namespace, PATHLEN) < 1) {
> >>> +                pr_perror("Unable to send namespace to remote image connection");
> >>> +                return -1;
> >>> +        } 
> >>> +        return 0;
> >>> +}
> >>> +
> >>
> >>> diff -uprN criu-source/page-read.c criu-patch/page-read.c
> >>> --- criu-source/page-read.c	2015-09-01 20:34:37.082774260 +0100
> >>> +++ criu-patch/page-read.c	2015-09-02 02:21:29.616164017 +0100
> >>> @@ -10,6 +10,8 @@
> >>>  #include "protobuf.h"
> >>>  #include "protobuf/pagemap.pb-c.h"
> >>>  
> >>> +#include "image-remote.h"
> >>> +
> >>>  #ifndef SEEK_DATA
> >>>  #define SEEK_DATA	3
> >>>  #define SEEK_HOLE	4
> >>> @@ -90,8 +92,17 @@ static void skip_pagemap_pages(struct pa
> >>>  		return;
> >>>  
> >>>  	pr_debug("\tpr%u Skip %lx bytes from page-dump\n", pr->id, len);
> >>> -	if (!pr->pe->in_parent)
> >>> -		lseek(img_raw_fd(pr->pi), len, SEEK_CUR);
> >>> +	if (!pr->pe->in_parent) {
> >>> +            if(opts.remote) {
> >>> +                    if(skip_remote_bytes(img_raw_fd(pr->pi), len) < 0)
> >>> +                            pr_perror("Unable to seek remote bytes");
> >>> +            }
> >>> +            else {
> >>> +                    if(lseek(img_raw_fd(pr->pi), len, SEEK_CUR) < 0)
> >>> +                            pr_perror("Unable to lseek");
> >>> +            }
> >>> +            	
> >>> +        }
> >>
> >> The page-read engine is already modularized. Don't introduce if()-s in the
> >> existing code, just add new set of options. The open_page_read() selects
> >> one of them.
> > 
> > Okay, I will do that.
> > 
> >>
> >>>  	pr->cvaddr += len;
> >>>  }
> >>>  
> >>
> >>> diff -uprN criu-source/page-xfer.c criu-patch/page-xfer.c
> >>> --- criu-source/page-xfer.c	2015-09-01 20:34:37.082774260 +0100
> >>> +++ criu-patch/page-xfer.c	2015-09-02 02:21:44.968518366 +0100
> >>> @@ -728,13 +730,21 @@ static int open_page_local_xfer(struct p
> >>>  		int ret;
> >>>  		int pfd;
> >>>  
> >>> -		pfd = openat(get_service_fd(IMG_FD_OFF), CR_PARENT_LINK, O_RDONLY);
> >>> -		if (pfd < 0 && errno == ENOENT)
> >>> -			goto out;
> >>> +		if(opts.remote) {
> >>> +                        pfd = get_current_namespace_fd() - 1;
> >>> +                        if(get_namespace(pfd) == NULL)
> >>> +                                goto out;
> >>> +                }
> >>> +                else {
> >>> +                        pfd = openat(get_service_fd(IMG_FD_OFF), CR_PARENT_LINK, O_RDONLY);
> >>> +                        if (pfd < 0 && errno == ENOENT)
> >>> +                                goto out;
> >>> +                }
> >>
> >> We already have network transfer for pages data. How does this correlate with
> >> the new mode you introduce?
> > 
> > Currently it doesn't correlate at all. I didn't use the page server because it 
> > only works with pages (as far as I know). Extending it to support all types of
> > images seemed more difficult than extending the disk-backed images.
> > 
> > Most of the code that bridges the existing code with the new code (basically the 
> > functions in image-remote.h) is inserted inside if conditions. In order to improve
> > this, perhaps it would be good to abstact the image backend: files, sockets.
> 
> OK, so as I told above -- I'm OK to deprecate the page-server with newer image
> proxy and cache.

Since this is more generic, in future, if you need/want, it will be easier to include 
application files migration (to avoid NFS shares for example).

> 
> -- Pavel

-- 
Rodrigo Bruno <rbruno at gsd.inesc-id.pt>