[CRIU] Process Migration Using Sockets - PATCH
Rodrigo Bruno
rbruno at gsd.inesc-id.pt
Mon Sep 7 15:47:05 PDT 2015
Hi,
I will answer inline, its easier.
On Mon, 7 Sep 2015 13:34:54 +0300
Pavel Emelyanov <xemul at parallels.com> wrote:
> On 09/02/2015 11:24 PM, Rodrigo Bruno wrote:
>
> Ridrigo, thanks for the patch. Sorry for the late response, I was busy with the
> 1.7 release. Now it's finished and we have some time for new cool features :)
>
> Find my comments inline.
>
> > The patch is listed below. The idea is to migrate processes without using disk-backed
> > images. Files used by these processes still need to be shared (NFS for example) to
> > enable full live migration. In future these files could also be transferred using
> > sockets.
> >
> > Two new entities are introduced: the image-proxy, and the image-cache. The image-proxy
> > receives the image files from the dump process and forwards them to the image-cache.
> > The image-cache waits for requests from the restore process.
>
> Can you shed a little bit more light on this: what's the way image-proxy gets images
> from criu dump and what's the way criu restore gets images from image-cache? Are these
> just unix sockets?
Yes. I patch (introduce an if) right before the image open. If the "opts.remote" is set
(this is a command line arg), I redirect the open to my new functions. Then I simply open
a socket and return the file descriptor (which is then used by the existing code).
>
> > Example:
> >
> > Target Node:
> > criu image-cache -vvv -o /tmp/image-cache.log --port <cache port> < /dev/null &
> > sudo criu restore -D /tmp/dump -d -vvvv -o /tmp/restore.log --remote && echo OK
> >
> > Source Node:
> > criu image-proxy -vvv -o /tmp/image-proxy.log --port <cache port> --address <target node> < /dev/null &
> > sudo criu pre-dump -D /tmp/pre-dump -d -vvvv -o /tmp/pre-dump.log -t $pid --remote
> > sudo criu dump -D /tmp/dump -d -vvvv -o /tmp/dump.log -t $pid --remote --prev-images-dir /tmp/pre-dump --track-mem
> >
> > The code is also available at https://github.com/rodrigo-bruno/criu (forked from CRIU).
> >
> > You can also test it locally. I have been using this to migrate OpenJDK processes.
> > If you ever decide to use this code, I would be glad to help, provide bug fixes, etc.
>
> I'd also appreciate if you split the patch into set. First there must go changes in
> the existing criu code that prepare one for easier further patches, then the component
> by component new stuff. E.g. first goes image-cache, then image-proxy, then changes
> in the dump code to support proxy, then changes in the restorer code to support cache.
Okey, I can do that.
>
> > Signed-off-by: Rodrigo Bruno <rbruno at gsd.inesc-id.pt>
> >
> > diff -uprN criu-source/cr-dedup.c criu-patch/cr-dedup.c
> > --- criu-source/cr-dedup.c 2015-09-01 20:34:37.042773339 +0100
> > +++ criu-patch/cr-dedup.c 2015-09-02 02:22:45.725920125 +0100
> > @@ -11,6 +11,7 @@
> >
> > static int cr_dedup_one_pagemap(int pid);
> >
> > +// TODO - Eventually patch this for remote usage?
>
> Please, use /* */ style comments.
Okay, I will fix that.
>
> > int cr_dedup(void)
> > {
> > int close_ret, ret = 0;
> > diff -uprN criu-source/cr-dump.c criu-patch/cr-dump.c
> > --- criu-source/cr-dump.c 2015-09-01 20:34:37.050773528 +0100
> > +++ criu-patch/cr-dump.c 2015-09-02 02:37:15.993970004 +0100
> > @@ -1550,6 +1552,10 @@ err:
> > if (disconnect_from_page_server())
> > ret = -1;
> >
> > + if (opts.remote) {
>
> Something has happened with tab indentation.
>
> > + finish_remote_dump();
> > + }
> > +
> > close_cr_imgset(&glob_imgset);
> >
> > if (bfd_flush_images())
>
> > diff: criu-source/crtools: No such file or directory
> > diff: criu-patch/crtools: No such file or directory
> > diff -uprN criu-source/crtools.c criu-patch/crtools.c
> > --- criu-source/crtools.c 2015-09-01 20:34:37.054773617 +0100
> > +++ criu-patch/crtools.c 2015-09-02 03:05:47.229581153 +0100
> > @@ -42,6 +42,8 @@
> >
> > #include "setproctitle.h"
> >
> > +#include "image-remote.h"
> > +
> > struct cr_options opts;
> >
> > void init_opts(void)
> > @@ -60,6 +62,8 @@ void init_opts(void)
> > opts.cpu_cap = CPU_CAP_DEFAULT;
> > opts.manage_cgroups = CG_MODE_DEFAULT;
> > opts.ps_socket = -1;
> > + opts.addr = PROXY_FWD_HOST;
> > + opts.ps_port = CACHE_PUT_PORT;
>
> You reuse the existing opts fields. How would this correlate with the --page-server
> code?
Well, I didn't use the page server. These options are reused just for simplicity.
When I use a remote dump or remote restore, I assume that the page server is not
being used, otherwise this would be a problem because they both use the same opts
fields.
>
> > opts.ghost_limit = DEFAULT_GHOST_LIMIT;
> > }
> >
>
> > diff -uprN criu-source/image.c criu-patch/image.c
> > --- criu-source/image.c 2015-09-01 20:34:37.058773708 +0100
> > +++ criu-patch/image.c 2015-09-02 02:57:48.502419478 +0100
> > @@ -336,6 +347,72 @@ static int do_open_image(struct cr_img *
> > if (imgset_template[type].magic == RAW_IMAGE_MAGIC)
> > goto skip_magic;
> >
> > + if (flags == O_RDONLY) {
> > + ret = img_check_magic(img, oflags, type, path);
> > + }
> > + else {
> > + ret = img_write_magic(img, oflags, type);
> > + }
> > + if (ret)
> > + goto err;
> > +
> > +skip_magic:
> > + return 0;
> > +
> > +err:
> > + return -1;
> > +}
> > +
> > +static int do_open_remote_image(struct cr_img *img, int dfd, int type, unsigned long oflags, char *path)
> > +{
> > + int ret, flags;
> > +
> > + flags = oflags & ~(O_NOBUF | O_SERVICE);
> > +
> > + if(dfd == get_service_fd(IMG_FD_OFF) || dfd == -1)
> > + dfd = get_current_namespace_fd();
>
> I didn't quite get the idea of namespaces. Can you descibe it in more details, please?
Namespaces are used to tag images according to their hierarchy. When you use pre-dumps, you
create a symlink ("parent") pointing to the directory where the pre-dump was stored. This
creates a relation between the current dump/pre-dump and the previous one. You then open this
link and use the file descriptor to continue operating (reading the pagemap of the previous
pre-dump for example).
I can't do this because images are stored in the image-proxy (memory cache). I solve this in
a kind of naive way. When a dump/pre-dump starts, I take the image working dir (given by the
user) and tag all images produced with the namespace (the working dir). When I see that a
dump/pre-dump is reffering a previous image dir I inform the image proxy that the current
working dir is a child of the one given as "--prev-images-dir".
Then, when criu tries to open the "parent" link, it will (only when it is in remote mode) an
identifier (not a real descriptor) that represents the parent namespace. This identifier is
used to indicate the desired namespace whe the image is openned.
>
> > +
> > + // TODO - fix this. Find out what is the purpose of this file.
> > + if(!strcmp("irmap-cache", path)) {
> > + ret = -1;
> > + }
> > + else if(get_namespace(dfd) == NULL) {
> > + ret = -1;
> > + }
> > + else if (flags == O_RDONLY) {
> > + pr_info("do_open_remote_image RDONLY path=%s namespace=%s\n",
> > + path, get_namespace(dfd));
> > + ret = get_remote_image_connection(get_namespace(dfd), path);
> > + }
> > + else {
> > + pr_info("do_open_remote_image WDONLY path=%s namespace=%s\n",
> > + path, get_namespace(dfd));
> > + ret = open_remote_image_connection(get_namespace(dfd), path);
> > + }
> > +
> > + if (ret < 0) {
> > + pr_info("No %s (dfd=%d) image\n", path, dfd);
> > + img->_x.fd = EMPTY_IMG_FD;
> > + goto skip_magic;
> > + }
> > +
> > +
> > + img->_x.fd = ret;
> > + if (oflags & O_NOBUF)
> > + bfd_setraw(&img->_x);
> > + else {
> > + if (flags == O_RDONLY)
> > + ret = bfdopenr(&img->_x);
> > + else
> > + ret = bfdopenw(&img->_x);
> > +
> > + if (ret)
> > + goto err;
> > + }
> > +
> > + if (imgset_template[type].magic == RAW_IMAGE_MAGIC)
> > + goto skip_magic;
> > +
> > if (flags == O_RDONLY)
> > ret = img_check_magic(img, oflags, type, path);
> > else
>
> > diff -uprN criu-source/image-remote.c criu-patch/image-remote.c
> > --- criu-source/image-remote.c 1970-01-01 01:00:00.000000000 +0100
> > +++ criu-patch/image-remote.c 2015-09-02 02:18:33.548099686 +0100
> > @@ -0,0 +1,281 @@
> > +#include <unistd.h>
> > +#include <stdlib.h>
> > +#include <sys/types.h>
> > +#include <sys/socket.h>
> > +#include <netinet/in.h>
> > +#include <netdb.h>
> > +
> > +#include <pthread.h>
> > +#include <semaphore.h>
> > +
> > +#include "criu-log.h"
> > +#include "image-remote.h"
> > +
> > +// TODO - fix space limitation
> > +static char parents[PATHLEN][PATHLEN];
> > +static int parents_occ = 0;
> > +static char* namespace = NULL;
> > +// TODO - not used for now. It will be used if we implement a shared cache and proxy.
> > +static char* parent = NULL;
> > +
> > +int setup_local_client_connection(int port)
> > +{
> > + int sockfd;
> > + struct sockaddr_in serv_addr;
> > + struct hostent *server;
> > +
> > + sockfd = socket(AF_INET, SOCK_STREAM, 0);
> > + if (sockfd < 0) {
> > + pr_perror("Unable to open remote image socket to img cache");
> > + return -1;
> > + }
> > +
> > + server = gethostbyname(DEFAULT_HOST);
> > + if (server == NULL) {
> > + pr_perror("Unable to get host by name (%s)", DEFAULT_HOST);
> > + return -1;
> > + }
> > +
> > + bzero((char *) &serv_addr, sizeof (serv_addr));
> > + serv_addr.sin_family = AF_INET;
> > + bcopy((char *) server->h_addr,
> > + (char *) &serv_addr.sin_addr.s_addr,
> > + server->h_length);
> > + serv_addr.sin_port = htons(port);
> > +
> > + if (connect(sockfd, (struct sockaddr *) &serv_addr, sizeof(serv_addr)) < 0) {
> > + pr_perror("Unable to connect to remote restore host %s", DEFAULT_HOST);
> > + return -1;
> > + }
> > +
> > + return sockfd;
> > +}
> > +
> > +int write_header(int fd, char* namespace, char* path)
>
> You seem to be using text-based protocol for images transfers, don't you?
Yes... When I read or write an image I always write the name and namespace of the
image before its actual content. This is used to identify the correct images when
they are requested later on.
>
> > +{
> > + if (write(fd, path, PATHLEN) < 1) {
> > + pr_perror("Unable to send path to remote image connection");
> > + return -1;
> > + }
> > +
> > + if (write(fd, namespace, PATHLEN) < 1) {
> > + pr_perror("Unable to send namespace to remote image connection");
> > + return -1;
> > + }
> > + return 0;
> > +}
> > +
>
> > diff -uprN criu-source/page-read.c criu-patch/page-read.c
> > --- criu-source/page-read.c 2015-09-01 20:34:37.082774260 +0100
> > +++ criu-patch/page-read.c 2015-09-02 02:21:29.616164017 +0100
> > @@ -10,6 +10,8 @@
> > #include "protobuf.h"
> > #include "protobuf/pagemap.pb-c.h"
> >
> > +#include "image-remote.h"
> > +
> > #ifndef SEEK_DATA
> > #define SEEK_DATA 3
> > #define SEEK_HOLE 4
> > @@ -90,8 +92,17 @@ static void skip_pagemap_pages(struct pa
> > return;
> >
> > pr_debug("\tpr%u Skip %lx bytes from page-dump\n", pr->id, len);
> > - if (!pr->pe->in_parent)
> > - lseek(img_raw_fd(pr->pi), len, SEEK_CUR);
> > + if (!pr->pe->in_parent) {
> > + if(opts.remote) {
> > + if(skip_remote_bytes(img_raw_fd(pr->pi), len) < 0)
> > + pr_perror("Unable to seek remote bytes");
> > + }
> > + else {
> > + if(lseek(img_raw_fd(pr->pi), len, SEEK_CUR) < 0)
> > + pr_perror("Unable to lseek");
> > + }
> > +
> > + }
>
> The page-read engine is already modularized. Don't introduce if()-s in the
> existing code, just add new set of options. The open_page_read() selects
> one of them.
Okay, I will do that.
>
> > pr->cvaddr += len;
> > }
> >
>
> > diff -uprN criu-source/page-xfer.c criu-patch/page-xfer.c
> > --- criu-source/page-xfer.c 2015-09-01 20:34:37.082774260 +0100
> > +++ criu-patch/page-xfer.c 2015-09-02 02:21:44.968518366 +0100
> > @@ -728,13 +730,21 @@ static int open_page_local_xfer(struct p
> > int ret;
> > int pfd;
> >
> > - pfd = openat(get_service_fd(IMG_FD_OFF), CR_PARENT_LINK, O_RDONLY);
> > - if (pfd < 0 && errno == ENOENT)
> > - goto out;
> > + if(opts.remote) {
> > + pfd = get_current_namespace_fd() - 1;
> > + if(get_namespace(pfd) == NULL)
> > + goto out;
> > + }
> > + else {
> > + pfd = openat(get_service_fd(IMG_FD_OFF), CR_PARENT_LINK, O_RDONLY);
> > + if (pfd < 0 && errno == ENOENT)
> > + goto out;
> > + }
>
> We already have network transfer for pages data. How does this correlate with
> the new mode you introduce?
Currently it doesn't correlate at all. I didn't use the page server because it
only works with pages (as far as I know). Extending it to support all types of
images seemed more difficult than extending the disk-backed images.
Most of the code that bridges the existing code with the new code (basically the
functions in image-remote.h) is inserted inside if conditions. In order to improve
this, perhaps it would be good to abstact the image backend: files, sockets.
>
> >
> > xfer->parent = xmalloc(sizeof(*xfer->parent));
> > if (!xfer->parent) {
> > - close(pfd);
> > + if(!opts.remote)
> > + close(pfd);
> > return -1;
> > }
> >
>
> -- Pavel
>
--
Rodrigo Bruno <rbruno at gsd.inesc-id.pt>
More information about the CRIU
mailing list