[CRIU] Process Migration Using Sockets - PATCH

Thu Sep 10 04:27:56 PDT 2015

On 09/09/2015 03:30 AM, Rodrigo Bruno wrote:

>> OK, so image-proxy is a multiplexer that merges several image flows into one
>> network flow. This is clear.
>>
>> The image-cache as I see it a page-server++ that accepts not only pages images,
>> but arbitrary ones. Then it feeds the images via socket into criu restore. So
>> my question is -- why not make image-cache put images into tmpfs mount and then
>> use regular criu restore?
> 
> It could be done. However, 
> 1) the modifications introduced in "open_image_at" (image.c) to open remote images 
> instead of disk-backed also work for the restore side, so no more modifications needed; 
> 2) the image-cache and image-proxy share most of the code, only a couple of lines differ. 
> The two components (cache and proxy) are very similar,being the only difference the fact 
> that the proxy forwards the images. Both components have one port for receiving and 
> caching images and one port for sending the cached images (the proxy needs to cache 
> images so that dumps/pre-dumps that need to open previous pagemaps and pages are able 
> to do so).

OK, let's try to go this way. But I'd like to add one more requirement to the code
(it can be addressed a bit later, but still before the 1.8).

If you look at the page-server code you'll notice that it can inherit a connection
socket, instead of creating its own one:

int connect_to_page_server(void)
{
        struct sockaddr_in saddr;

        if (!opts.use_page_server)
                return 0;

        if (opts.ps_socket != -1) {
                page_server_sk = opts.ps_socket;
                pr_info("Re-using ps socket %d\n", page_server_sk);
                return 0;
        }

        pr_info("Connecting to server %s:%u\n",
                        opts.addr, (int)ntohs(opts.ps_port));

        page_server_sk = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);
        ...

the if (opts.ps_socket) is the one. This is required to facilitate live migration
cases for OpenVZ, LXC and, probably, Docker, who create all transport sockets their
own and would like criu to re-use those.

So for your proxy and cache engines sockets inheritance should also work, otherwise
nobody will be able to use them IRL.

>> Can you describe your protocol then in more details. Why I don't quite understand
>> is how you mix text info with binary data (images and pages) and how you define
>> borders between objects.
> 
> Yes. Two communications can exist: writing and reading remote images. They both use the
> same protocol:
> 
> 1. open socket on read or write port (both cache and proxy have one of each)
> 2. write image pathname (32 bytes)
> 3. write image namespace (32 bytes)
> 4. read image pathname (32 bytes)
> 5. read image namesapce (32 bytes)

I see. I would still suggest to switch this header onto protobuf format so that we
could extend one later (and easily drop the 32-bytes limitation).

> The 32 bytes is a constant. It is a limitation on the size of the names and namespaces. It 
> could be solved by adding a first field (size).
> 
> Steps 4 and 5 are used to check if the image exists (if it doesn't, an error is read from the 
> socket).
> 
> Then, I simply return the socket file descriptor and let criu use it for writing (dump) or 
> reading (restore).
> 
> Note that I do not unpack the objects being sent (pagemap entries for example). Neither I 
> check if the file was correctly sent. When the socket FD is closed I assume that the 
> operation (reading or writing) is complete and I let the other side (restore for example)
> to check if the image is correct.

I see one problem with it. The other side cannot always verify the image correctness. E.g.
if a single object is lost from image, this can only be found out at the actual restore
time, but not always. E.g. fdtable.img contains file descriptors. If one is lost from there
a task will be restored w/o one descriptor and criu has no glues to check this.

We've seen this with the page-server, so to finalize the transfer we use special command
(PS_IOV_FLUSH) that requires response from the server side indicating that everything is OK.

>>
>> Another thing about this protocol -- in page-server protocol we use hand-made binary
>> headers (packed structs) that precede individual extents (sets of pages). Now I think
>> it was mistake to use fully hand-made protocol. Why not stick with some more standard
>> one, at least use protobuf-compiled headers. This would help us to at least solve
>> the extendability problem.
> 
> I agree that hand-made protocols are usually a bad idea compared to using standard 
> approaches. Including the name and namespace of the images inside the actual
> content of the images would be enough to avoid this hand-made protocol. Then I would
> only need to read the header.

OK, the packet sequence can be done by hands, but let's encode headers you use into
protobuf rather than plan strings.

>>>> We already have network transfer for pages data. How does this correlate with
>>>> the new mode you introduce?
>>>
>>> Currently it doesn't correlate at all. I didn't use the page server because it 
>>> only works with pages (as far as I know). Extending it to support all types of
>>> images seemed more difficult than extending the disk-backed images.
>>>
>>> Most of the code that bridges the existing code with the new code (basically the 
>>> functions in image-remote.h) is inserted inside if conditions. In order to improve
>>> this, perhaps it would be good to abstact the image backend: files, sockets.
>>
>> OK, so as I told above -- I'm OK to deprecate the page-server with newer image
>> proxy and cache.
> 
> Since this is more generic, in future, if you need/want, it will be easier to include 
> application files migration (to avoid NFS shares for example).

Yup.

-- Pavel