[CRIU] Process Migration Using Sockets - PATCH
Rodrigo Bruno
rbruno at gsd.inesc-id.pt
Fri Sep 11 08:38:33 PDT 2015
On Thu, 10 Sep 2015 14:27:56 +0300
Pavel Emelyanov <xemul at parallels.com> wrote:
> On 09/09/2015 03:30 AM, Rodrigo Bruno wrote:
>
> >> OK, so image-proxy is a multiplexer that merges several image flows into one
> >> network flow. This is clear.
> >>
> >> The image-cache as I see it a page-server++ that accepts not only pages images,
> >> but arbitrary ones. Then it feeds the images via socket into criu restore. So
> >> my question is -- why not make image-cache put images into tmpfs mount and then
> >> use regular criu restore?
> >
> > It could be done. However,
> > 1) the modifications introduced in "open_image_at" (image.c) to open remote images
> > instead of disk-backed also work for the restore side, so no more modifications needed;
> > 2) the image-cache and image-proxy share most of the code, only a couple of lines differ.
> > The two components (cache and proxy) are very similar,being the only difference the fact
> > that the proxy forwards the images. Both components have one port for receiving and
> > caching images and one port for sending the cached images (the proxy needs to cache
> > images so that dumps/pre-dumps that need to open previous pagemaps and pages are able
> > to do so).
>
> OK, let's try to go this way. But I'd like to add one more requirement to the code
> (it can be addressed a bit later, but still before the 1.8).
>
> If you look at the page-server code you'll notice that it can inherit a connection
> socket, instead of creating its own one:
>
> int connect_to_page_server(void)
> {
> struct sockaddr_in saddr;
>
> if (!opts.use_page_server)
> return 0;
>
> if (opts.ps_socket != -1) {
> page_server_sk = opts.ps_socket;
> pr_info("Re-using ps socket %d\n", page_server_sk);
> return 0;
> }
>
> pr_info("Connecting to server %s:%u\n",
> opts.addr, (int)ntohs(opts.ps_port));
>
> page_server_sk = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);
> ...
>
> the if (opts.ps_socket) is the one. This is required to facilitate live migration
> cases for OpenVZ, LXC and, probably, Docker, who create all transport sockets their
> own and would like criu to re-use those.
>
> So for your proxy and cache engines sockets inheritance should also work, otherwise
> nobody will be able to use them IRL.
Okey, no problem.
>
>
> >> Can you describe your protocol then in more details. Why I don't quite understand
> >> is how you mix text info with binary data (images and pages) and how you define
> >> borders between objects.
> >
> > Yes. Two communications can exist: writing and reading remote images. They both use the
> > same protocol:
> >
> > 1. open socket on read or write port (both cache and proxy have one of each)
> > 2. write image pathname (32 bytes)
> > 3. write image namespace (32 bytes)
> > 4. read image pathname (32 bytes)
> > 5. read image namesapce (32 bytes)
>
> I see. I would still suggest to switch this header onto protobuf format so that we
> could extend one later (and easily drop the 32-bytes limitation).
>
> > The 32 bytes is a constant. It is a limitation on the size of the names and namespaces. It
> > could be solved by adding a first field (size).
> >
> > Steps 4 and 5 are used to check if the image exists (if it doesn't, an error is read from the
> > socket).
> >
> > Then, I simply return the socket file descriptor and let criu use it for writing (dump) or
> > reading (restore).
> >
> > Note that I do not unpack the objects being sent (pagemap entries for example). Neither I
> > check if the file was correctly sent. When the socket FD is closed I assume that the
> > operation (reading or writing) is complete and I let the other side (restore for example)
> > to check if the image is correct.
>
> I see one problem with it. The other side cannot always verify the image correctness. E.g.
> if a single object is lost from image, this can only be found out at the actual restore
> time, but not always. E.g. fdtable.img contains file descriptors. If one is lost from there
> a task will be restored w/o one descriptor and criu has no glues to check this.
>
> We've seen this with the page-server, so to finalize the transfer we use special command
> (PS_IOV_FLUSH) that requires response from the server side indicating that everything is OK.
Okey, so two things to do:
a- initial "handshake" with protobuf object in which
1- the side making the request sends a protobuf object with the name and namespace and
2- the side answering replies with another protobuf object with a ack/nack
b- final "handshake" in which
1- the sender reports that it is done writing and
2- the receiver answers that everything is okay.
I will see how PS_IOV_FLUSH is implemented to make the second one.
You want me to do these changes before sending the patches per components, right?
>
> >>
> >> Another thing about this protocol -- in page-server protocol we use hand-made binary
> >> headers (packed structs) that precede individual extents (sets of pages). Now I think
> >> it was mistake to use fully hand-made protocol. Why not stick with some more standard
> >> one, at least use protobuf-compiled headers. This would help us to at least solve
> >> the extendability problem.
> >
> > I agree that hand-made protocols are usually a bad idea compared to using standard
> > approaches. Including the name and namespace of the images inside the actual
> > content of the images would be enough to avoid this hand-made protocol. Then I would
> > only need to read the header.
>
> OK, the packet sequence can be done by hands, but let's encode headers you use into
> protobuf rather than plan strings.
>
>
> >>>> We already have network transfer for pages data. How does this correlate with
> >>>> the new mode you introduce?
> >>>
> >>> Currently it doesn't correlate at all. I didn't use the page server because it
> >>> only works with pages (as far as I know). Extending it to support all types of
> >>> images seemed more difficult than extending the disk-backed images.
> >>>
> >>> Most of the code that bridges the existing code with the new code (basically the
> >>> functions in image-remote.h) is inserted inside if conditions. In order to improve
> >>> this, perhaps it would be good to abstact the image backend: files, sockets.
> >>
> >> OK, so as I told above -- I'm OK to deprecate the page-server with newer image
> >> proxy and cache.
> >
> > Since this is more generic, in future, if you need/want, it will be easier to include
> > application files migration (to avoid NFS shares for example).
>
> Yup.
>
> -- Pavel
--
Rodrigo Bruno <rbruno at gsd.inesc-id.pt>
More information about the CRIU
mailing list