[CRIU] Process Migration Using Sockets - PATCH

Rodrigo Bruno rbruno at gsd.inesc-id.pt
Mon Sep 7 15:47:05 PDT 2015


Hi, 

I will answer inline, its easier.

On Mon, 7 Sep 2015 13:34:54 +0300
Pavel Emelyanov <xemul at parallels.com> wrote:

> On 09/02/2015 11:24 PM, Rodrigo Bruno wrote:
> 
> Ridrigo, thanks for the patch. Sorry for the late response, I was busy with the
> 1.7 release. Now it's finished and we have some time for new cool features :)
> 
> Find my comments inline.
> 
> > The patch is listed below. The idea is to migrate processes without using disk-backed
> > images. Files used by these processes still need to be shared (NFS for example) to 
> > enable full live migration. In future these files could also be transferred using 
> > sockets.
> > 
> > Two new entities are introduced: the image-proxy, and the image-cache. The image-proxy
> > receives the image files from the dump process and forwards them to the image-cache. 
> > The image-cache waits for requests from  the restore process.
> 
> Can you shed a little bit more light on this: what's the way image-proxy gets images
> from criu dump and what's the way criu restore gets images from image-cache? Are these
> just unix sockets?

Yes. I patch (introduce an if) right before the image open. If the "opts.remote" is set
(this is a command line arg), I redirect the open to my new functions. Then I simply open
a socket and return the file descriptor (which is then used by the existing code). 

> 
> > Example:
> > 
> > Target Node:
> > criu image-cache -vvv -o /tmp/image-cache.log --port <cache port> < /dev/null &
> > sudo criu restore -D /tmp/dump -d -vvvv -o /tmp/restore.log  --remote && echo OK
> > 
> > Source Node:
> > criu image-proxy -vvv -o /tmp/image-proxy.log --port <cache port> --address <target node> < /dev/null &
> > sudo criu pre-dump -D /tmp/pre-dump -d -vvvv -o /tmp/pre-dump.log -t $pid --remote
> > sudo criu dump -D /tmp/dump -d -vvvv -o /tmp/dump.log -t $pid --remote  --prev-images-dir /tmp/pre-dump --track-mem
> > 
> > The code is also available at https://github.com/rodrigo-bruno/criu (forked from CRIU).
> > 
> > You can also test it locally. I have been using this to migrate OpenJDK processes.
> > If you ever decide to use this code, I would be glad to help, provide bug fixes, etc.
> 
> I'd also appreciate if you split the patch into set. First there must go changes in
> the existing criu code that prepare one for easier further patches, then the component
> by component new stuff. E.g. first goes image-cache, then image-proxy, then changes
> in the dump code to support proxy, then changes in the restorer code to support cache.

Okey, I can do that.

> 
> > Signed-off-by: Rodrigo Bruno <rbruno at gsd.inesc-id.pt>
> > 
> > diff -uprN criu-source/cr-dedup.c criu-patch/cr-dedup.c
> > --- criu-source/cr-dedup.c	2015-09-01 20:34:37.042773339 +0100
> > +++ criu-patch/cr-dedup.c	2015-09-02 02:22:45.725920125 +0100
> > @@ -11,6 +11,7 @@
> >  
> >  static int cr_dedup_one_pagemap(int pid);
> >  
> > +// TODO - Eventually patch this for remote usage?
> 
> Please, use /* */ style comments.

Okay, I will fix that.

> 
> >  int cr_dedup(void)
> >  {
> >  	int close_ret, ret = 0;
> > diff -uprN criu-source/cr-dump.c criu-patch/cr-dump.c
> > --- criu-source/cr-dump.c	2015-09-01 20:34:37.050773528 +0100
> > +++ criu-patch/cr-dump.c	2015-09-02 02:37:15.993970004 +0100
> > @@ -1550,6 +1552,10 @@ err:
> >  	if (disconnect_from_page_server())
> >  		ret = -1;
> >  
> > +        if (opts.remote) {
> 
> Something has happened with tab indentation.
> 
> > +            finish_remote_dump();
> > +        }
> > +
> >  	close_cr_imgset(&glob_imgset);
> >  
> >  	if (bfd_flush_images())
> 
> > diff: criu-source/crtools: No such file or directory
> > diff: criu-patch/crtools: No such file or directory
> > diff -uprN criu-source/crtools.c criu-patch/crtools.c
> > --- criu-source/crtools.c	2015-09-01 20:34:37.054773617 +0100
> > +++ criu-patch/crtools.c	2015-09-02 03:05:47.229581153 +0100
> > @@ -42,6 +42,8 @@
> >  
> >  #include "setproctitle.h"
> >  
> > +#include "image-remote.h"
> > +
> >  struct cr_options opts;
> >  
> >  void init_opts(void)
> > @@ -60,6 +62,8 @@ void init_opts(void)
> >  	opts.cpu_cap = CPU_CAP_DEFAULT;
> >  	opts.manage_cgroups = CG_MODE_DEFAULT;
> >  	opts.ps_socket = -1;
> > +	opts.addr = PROXY_FWD_HOST;
> > +	opts.ps_port = CACHE_PUT_PORT;
> 
> You reuse the existing opts fields. How would this correlate with the --page-server
> code?

Well, I didn't use the page server. These options are reused just for simplicity. 
When I use a remote dump or remote restore, I assume that the page server is not
being used, otherwise this would be a problem because they both use the same opts
fields.

> 
> >  	opts.ghost_limit = DEFAULT_GHOST_LIMIT;
> >  }
> >  
> 
> > diff -uprN criu-source/image.c criu-patch/image.c
> > --- criu-source/image.c	2015-09-01 20:34:37.058773708 +0100
> > +++ criu-patch/image.c	2015-09-02 02:57:48.502419478 +0100
> > @@ -336,6 +347,72 @@ static int do_open_image(struct cr_img *
> >  	if (imgset_template[type].magic == RAW_IMAGE_MAGIC)
> >  		goto skip_magic;
> >  
> > +	if (flags == O_RDONLY) {
> > +		ret = img_check_magic(img, oflags, type, path);
> > +        }
> > +	else {
> > +		ret = img_write_magic(img, oflags, type);
> > +        }
> > +	if (ret)
> > +		goto err;
> > +
> > +skip_magic:
> > +	return 0;
> > +
> > +err:
> > +	return -1;
> > +}
> > +
> > +static int do_open_remote_image(struct cr_img *img, int dfd, int type, unsigned long oflags, char *path)
> > +{
> > +	int ret, flags;
> > +
> > +	flags = oflags & ~(O_NOBUF | O_SERVICE);
> > +        
> > +        if(dfd == get_service_fd(IMG_FD_OFF) || dfd == -1)
> > +            dfd = get_current_namespace_fd();
> 
> I didn't quite get the idea of namespaces. Can you descibe it in more details, please?

Namespaces are used to tag images according to their hierarchy. When you use pre-dumps, you
create a symlink ("parent") pointing to the directory where the pre-dump was stored. This
creates a relation between the current dump/pre-dump and the previous one. You then open this
link and use the file descriptor to continue operating (reading the pagemap of the previous
pre-dump for example).

I can't do this because images are stored in the image-proxy (memory cache). I solve this in 
a kind of naive way. When a dump/pre-dump starts, I take the image working dir (given by the 
user) and tag all images produced with the namespace (the working dir). When I see that a 
dump/pre-dump is reffering a previous image dir I inform the image proxy that the current 
working dir is a child of the one given as "--prev-images-dir". 

Then, when criu tries to open the "parent" link, it will (only when it is in remote mode) an 
identifier (not a real descriptor) that represents the parent namespace. This identifier is 
used to indicate the desired namespace whe the image is openned.

> 
> > +        
> > +        // TODO - fix this. Find out what is the purpose of this file.
> > +        if(!strcmp("irmap-cache", path)) {
> > +            ret = -1;
> > +        }
> > +        else if(get_namespace(dfd) == NULL) {
> > +            ret = -1;
> > +        }
> > +        else if (flags == O_RDONLY) {
> > +            pr_info("do_open_remote_image RDONLY path=%s namespace=%s\n", 
> > +                    path, get_namespace(dfd));
> > +            ret = get_remote_image_connection(get_namespace(dfd), path);
> > +        }
> > +        else {
> > +            pr_info("do_open_remote_image WDONLY path=%s namespace=%s\n", 
> > +                    path, get_namespace(dfd));
> > +            ret = open_remote_image_connection(get_namespace(dfd), path);
> > +        }
> > +        
> > +        if (ret < 0) {
> > +            pr_info("No %s (dfd=%d) image\n", path, dfd);
> > +            img->_x.fd = EMPTY_IMG_FD;
> > +            goto skip_magic;
> > +	}
> > +        
> > +
> > +	img->_x.fd = ret;
> > +	if (oflags & O_NOBUF)
> > +		bfd_setraw(&img->_x);
> > +	else {
> > +		if (flags == O_RDONLY)
> > +			ret = bfdopenr(&img->_x);
> > +		else
> > +			ret = bfdopenw(&img->_x);
> > +
> > +		if (ret)
> > +			goto err;
> > +	}
> > +
> > +	if (imgset_template[type].magic == RAW_IMAGE_MAGIC)
> > +		goto skip_magic;
> > +
> >  	if (flags == O_RDONLY)
> >  		ret = img_check_magic(img, oflags, type, path);
> >  	else
> 
> > diff -uprN criu-source/image-remote.c criu-patch/image-remote.c
> > --- criu-source/image-remote.c	1970-01-01 01:00:00.000000000 +0100
> > +++ criu-patch/image-remote.c	2015-09-02 02:18:33.548099686 +0100
> > @@ -0,0 +1,281 @@
> > +#include <unistd.h>
> > +#include <stdlib.h>
> > +#include <sys/types.h> 
> > +#include <sys/socket.h>
> > +#include <netinet/in.h>
> > +#include <netdb.h>
> > +
> > +#include <pthread.h>
> > +#include <semaphore.h>
> > +
> > +#include "criu-log.h"
> > +#include "image-remote.h"
> > +
> > +// TODO - fix space limitation
> > +static char parents[PATHLEN][PATHLEN]; 
> > +static int  parents_occ = 0;
> > +static char* namespace = NULL;
> > +// TODO - not used for now. It will be used if we implement a shared cache and proxy.
> > +static char* parent = NULL; 
> > +
> > +int setup_local_client_connection(int port) 
> > +{
> > +        int sockfd;
> > +        struct sockaddr_in serv_addr;
> > +        struct hostent *server;
> > +
> > +        sockfd = socket(AF_INET, SOCK_STREAM, 0);
> > +        if (sockfd < 0) {
> > +                pr_perror("Unable to open remote image socket to img cache");
> > +                return -1;
> > +        }
> > +
> > +        server = gethostbyname(DEFAULT_HOST);
> > +        if (server == NULL) {
> > +                pr_perror("Unable to get host by name (%s)", DEFAULT_HOST);
> > +                return -1;
> > +        }
> > +
> > +        bzero((char *) &serv_addr, sizeof (serv_addr));
> > +        serv_addr.sin_family = AF_INET;
> > +        bcopy((char *) server->h_addr,
> > +              (char *) &serv_addr.sin_addr.s_addr,
> > +              server->h_length);
> > +        serv_addr.sin_port = htons(port);
> > +
> > +        if (connect(sockfd, (struct sockaddr *) &serv_addr, sizeof(serv_addr)) < 0) {
> > +                pr_perror("Unable to connect to remote restore host %s", DEFAULT_HOST);
> > +                return -1;
> > +        }
> > +
> > +        return sockfd;
> > +}
> > +
> > +int write_header(int fd, char* namespace, char* path)
> 
> You seem to be using text-based protocol for images transfers, don't you?

Yes... When I read or write an image I always write the name and namespace of the 
image before its actual content. This is used to identify the correct images when
they are requested later on.

> 
> > +{
> > +        if (write(fd, path, PATHLEN) < 1) {
> > +                pr_perror("Unable to send path to remote image connection");
> > +                return -1;
> > +        }
> > +
> > +        if (write(fd, namespace, PATHLEN) < 1) {
> > +                pr_perror("Unable to send namespace to remote image connection");
> > +                return -1;
> > +        } 
> > +        return 0;
> > +}
> > +
> 
> > diff -uprN criu-source/page-read.c criu-patch/page-read.c
> > --- criu-source/page-read.c	2015-09-01 20:34:37.082774260 +0100
> > +++ criu-patch/page-read.c	2015-09-02 02:21:29.616164017 +0100
> > @@ -10,6 +10,8 @@
> >  #include "protobuf.h"
> >  #include "protobuf/pagemap.pb-c.h"
> >  
> > +#include "image-remote.h"
> > +
> >  #ifndef SEEK_DATA
> >  #define SEEK_DATA	3
> >  #define SEEK_HOLE	4
> > @@ -90,8 +92,17 @@ static void skip_pagemap_pages(struct pa
> >  		return;
> >  
> >  	pr_debug("\tpr%u Skip %lx bytes from page-dump\n", pr->id, len);
> > -	if (!pr->pe->in_parent)
> > -		lseek(img_raw_fd(pr->pi), len, SEEK_CUR);
> > +	if (!pr->pe->in_parent) {
> > +            if(opts.remote) {
> > +                    if(skip_remote_bytes(img_raw_fd(pr->pi), len) < 0)
> > +                            pr_perror("Unable to seek remote bytes");
> > +            }
> > +            else {
> > +                    if(lseek(img_raw_fd(pr->pi), len, SEEK_CUR) < 0)
> > +                            pr_perror("Unable to lseek");
> > +            }
> > +            	
> > +        }
> 
> The page-read engine is already modularized. Don't introduce if()-s in the
> existing code, just add new set of options. The open_page_read() selects
> one of them.

Okay, I will do that.

> 
> >  	pr->cvaddr += len;
> >  }
> >  
> 
> > diff -uprN criu-source/page-xfer.c criu-patch/page-xfer.c
> > --- criu-source/page-xfer.c	2015-09-01 20:34:37.082774260 +0100
> > +++ criu-patch/page-xfer.c	2015-09-02 02:21:44.968518366 +0100
> > @@ -728,13 +730,21 @@ static int open_page_local_xfer(struct p
> >  		int ret;
> >  		int pfd;
> >  
> > -		pfd = openat(get_service_fd(IMG_FD_OFF), CR_PARENT_LINK, O_RDONLY);
> > -		if (pfd < 0 && errno == ENOENT)
> > -			goto out;
> > +		if(opts.remote) {
> > +                        pfd = get_current_namespace_fd() - 1;
> > +                        if(get_namespace(pfd) == NULL)
> > +                                goto out;
> > +                }
> > +                else {
> > +                        pfd = openat(get_service_fd(IMG_FD_OFF), CR_PARENT_LINK, O_RDONLY);
> > +                        if (pfd < 0 && errno == ENOENT)
> > +                                goto out;
> > +                }
> 
> We already have network transfer for pages data. How does this correlate with
> the new mode you introduce?

Currently it doesn't correlate at all. I didn't use the page server because it 
only works with pages (as far as I know). Extending it to support all types of
images seemed more difficult than extending the disk-backed images.

Most of the code that bridges the existing code with the new code (basically the 
functions in image-remote.h) is inserted inside if conditions. In order to improve
this, perhaps it would be good to abstact the image backend: files, sockets.

> 
> >  
> >  		xfer->parent = xmalloc(sizeof(*xfer->parent));
> >  		if (!xfer->parent) {
> > -			close(pfd);
> > +			if(!opts.remote)
> > +				close(pfd);
> >  			return -1;
> >  		}
> >  
> 
> -- Pavel
> 


-- 
Rodrigo Bruno <rbruno at gsd.inesc-id.pt>


More information about the CRIU mailing list