[CRIU] Process Migration Using Sockets - PATCH

Fri Sep 11 08:38:33 PDT 2015

On Thu, 10 Sep 2015 14:27:56 +0300
Pavel Emelyanov <xemul at parallels.com> wrote:

> On 09/09/2015 03:30 AM, Rodrigo Bruno wrote:
> 
> >> OK, so image-proxy is a multiplexer that merges several image flows into one
> >> network flow. This is clear.
> >>
> >> The image-cache as I see it a page-server++ that accepts not only pages images,
> >> but arbitrary ones. Then it feeds the images via socket into criu restore. So
> >> my question is -- why not make image-cache put images into tmpfs mount and then
> >> use regular criu restore?
> > 
> > It could be done. However, 
> > 1) the modifications introduced in "open_image_at" (image.c) to open remote images 
> > instead of disk-backed also work for the restore side, so no more modifications needed; 
> > 2) the image-cache and image-proxy share most of the code, only a couple of lines differ. 
> > The two components (cache and proxy) are very similar,being the only difference the fact 
> > that the proxy forwards the images. Both components have one port for receiving and 
> > caching images and one port for sending the cached images (the proxy needs to cache 
> > images so that dumps/pre-dumps that need to open previous pagemaps and pages are able 
> > to do so).
> 
> OK, let's try to go this way. But I'd like to add one more requirement to the code
> (it can be addressed a bit later, but still before the 1.8).
> 
> If you look at the page-server code you'll notice that it can inherit a connection
> socket, instead of creating its own one:
> 
> int connect_to_page_server(void)
> {
>         struct sockaddr_in saddr;
> 
>         if (!opts.use_page_server)
>                 return 0;
> 
>         if (opts.ps_socket != -1) {
>                 page_server_sk = opts.ps_socket;
>                 pr_info("Re-using ps socket %d\n", page_server_sk);
>                 return 0;
>         }
> 
>         pr_info("Connecting to server %s:%u\n",
>                         opts.addr, (int)ntohs(opts.ps_port));
> 
>         page_server_sk = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);
>         ...
> 
> the if (opts.ps_socket) is the one. This is required to facilitate live migration
> cases for OpenVZ, LXC and, probably, Docker, who create all transport sockets their
> own and would like criu to re-use those.
> 
> So for your proxy and cache engines sockets inheritance should also work, otherwise
> nobody will be able to use them IRL.

Okey, no problem.

> 
> 
> >> Can you describe your protocol then in more details. Why I don't quite understand
> >> is how you mix text info with binary data (images and pages) and how you define
> >> borders between objects.
> > 
> > Yes. Two communications can exist: writing and reading remote images. They both use the
> > same protocol:
> > 
> > 1. open socket on read or write port (both cache and proxy have one of each)
> > 2. write image pathname (32 bytes)
> > 3. write image namespace (32 bytes)
> > 4. read image pathname (32 bytes)
> > 5. read image namesapce (32 bytes)
> 
> I see. I would still suggest to switch this header onto protobuf format so that we
> could extend one later (and easily drop the 32-bytes limitation).
> 
> > The 32 bytes is a constant. It is a limitation on the size of the names and namespaces. It 
> > could be solved by adding a first field (size).
> > 
> > Steps 4 and 5 are used to check if the image exists (if it doesn't, an error is read from the 
> > socket).
> > 
> > Then, I simply return the socket file descriptor and let criu use it for writing (dump) or 
> > reading (restore).
> > 
> > Note that I do not unpack the objects being sent (pagemap entries for example). Neither I 
> > check if the file was correctly sent. When the socket FD is closed I assume that the 
> > operation (reading or writing) is complete and I let the other side (restore for example)
> > to check if the image is correct.
> 
> I see one problem with it. The other side cannot always verify the image correctness. E.g.
> if a single object is lost from image, this can only be found out at the actual restore
> time, but not always. E.g. fdtable.img contains file descriptors. If one is lost from there
> a task will be restored w/o one descriptor and criu has no glues to check this.
> 
> We've seen this with the page-server, so to finalize the transfer we use special command
> (PS_IOV_FLUSH) that requires response from the server side indicating that everything is OK.

Okey, so two things to do:

a- initial "handshake" with protobuf object in which
	1- the side making the request sends a protobuf object with the name and namespace and 
	2- the side answering replies with another protobuf object with a ack/nack

b- final "handshake" in which
	1- the sender reports that it is done writing and
	2- the receiver answers that everything is okay.

I will see how PS_IOV_FLUSH is implemented to make the second one.

You want me to do these changes before sending the patches per components, right?

> 
> >>
> >> Another thing about this protocol -- in page-server protocol we use hand-made binary
> >> headers (packed structs) that precede individual extents (sets of pages). Now I think
> >> it was mistake to use fully hand-made protocol. Why not stick with some more standard
> >> one, at least use protobuf-compiled headers. This would help us to at least solve
> >> the extendability problem.
> > 
> > I agree that hand-made protocols are usually a bad idea compared to using standard 
> > approaches. Including the name and namespace of the images inside the actual
> > content of the images would be enough to avoid this hand-made protocol. Then I would
> > only need to read the header.
> 
> OK, the packet sequence can be done by hands, but let's encode headers you use into
> protobuf rather than plan strings.
> 
> 
> >>>> We already have network transfer for pages data. How does this correlate with
> >>>> the new mode you introduce?
> >>>
> >>> Currently it doesn't correlate at all. I didn't use the page server because it 
> >>> only works with pages (as far as I know). Extending it to support all types of
> >>> images seemed more difficult than extending the disk-backed images.
> >>>
> >>> Most of the code that bridges the existing code with the new code (basically the 
> >>> functions in image-remote.h) is inserted inside if conditions. In order to improve
> >>> this, perhaps it would be good to abstact the image backend: files, sockets.
> >>
> >> OK, so as I told above -- I'm OK to deprecate the page-server with newer image
> >> proxy and cache.
> > 
> > Since this is more generic, in future, if you need/want, it will be easier to include 
> > application files migration (to avoid NFS shares for example).
> 
> Yup.
> 
> -- Pavel

-- 
Rodrigo Bruno <rbruno at gsd.inesc-id.pt>