[CRIU] Remote lazy-restore design discussion

Mon Apr 11 03:15:52 PDT 2016

On 04/11/2016 09:12 AM, Mike Rapoport wrote:
> On Tue, Apr 05, 2016 at 07:04:45PM +0300, Pavel Emelyanov wrote:
>> On 04/04/2016 07:20 PM, Mike Rapoport wrote:
>>> On Mon, Apr 04, 2016 at 04:06:50PM +0300, Pavel Emelyanov wrote:
>>>> On 03/31/2016 05:25 PM, Adrian Reber wrote:
>>>>
>>>> I would also add that process 3 should not only listen for page requests, but
>>>> also send other pages in the background. Probably the ideal process 3 should
>>>>
>>>> 1. Have a queue of pages to be sent (struct page_server_iov-s)
>>>> 2. Fill it with pages that were not transfered (ANON|PRIVATE)
>>>> 3. Start sending them one by one
>>>> 4. Receive messages from the process #3 that can move some items from
>>>>    the queue on top (i.e. -- the pages that are needed right now)
>>>
>>> Well, I actually thought more about "pull" than "push" approach. The pages
>>> are anyway collected into pagemap.img and it may be shared between source
>>> and destination. Than page-read on the restore (destination) side works
>>> almost as now, just instead of read(fd, ...) it does recv(sock, ...).
>>> I have some ugly POC (below) that kinda demonstrates the idea.
>>>
>>> If I understand you idea correctly, the dump side requires addition of
>>> beckground process that will handle random pages requests.
>>
>> Well, yes. Think of it this way -- the restore side daemon will have to
>> poll for many uffds and handle incoming data from socket. So it will be
>> a state machine that keeps track of pids, associated uffds and the reasons
>> for which they are hung. So adding there background traffic shouldn't be
>> a big deal.
> 
> What do you mean by "background" traffic on the restore side? 

The pages that arrive to restore daemon just because dump-side has sent them.

> The pages that are actually faulted or the rest of the memory?

The rest of the memory.

> I think that the transfer of the faulted pages should be blocking.

Blocking from the faulting process perspective, yes. But the uffd daemon itself
should not block anywhere. Consider the case when it was woken up by PF from one
task and has blocked waiting for memory from the restore side. All this time
other processes will be caught in PF-s even if their memory has already arrived.

> [ snip ] 
>  
>> Well, how about this:
>>
>> I. Dump side.
>>
>> The criu dump process dumps everything but the lazy pagemaps, lazy pagemaps
>> are skipped and are queued.
> 
> By pagemaps you mean the content of pagemap images + actual pages? 

Yes.

> Isn't pagemap image required on the restore side anyway?

Yes, the pagemap images should always be at hands.

>> Then criu dump spawns a daemon that opens a connection to the remote host,
>> creates page_server_xfer, takes the queue of pagemaps-s that are to be sent
>> to it and starts polling the xfer socket.
>>
>> When available for read, it gets request for particular pagemap and pops one
>> up in the queue.
> 
> How many pages are you planning to send upon request? The entire sequence
> that contains the faulted page, just the page that was requested, or, say,
> the page that was requested plus two pages before and two pages after?

That's good question and it deserves separate long discussion :) I would start
with requesting a single page.

>> When available for write it gets the next pagemap from queue and ->writepage
>> one to the page_xfer.
>>
>> II. Restore side
>>
>> The uffd daemon is spawned, it opens a port (to which dump will connect, or
>> uses the opts.ps_socket provided connection, the connect_to_page_server()
>> knows this), creates a hash with (pid, uffd, pagemaps) structures (called
>> lazy_data below) and listens.
>>
>> Restore prepares processes and mappings (Adrian's code already does this), sending
>> uffd-s to the uffd daemon (already there).
>>
>> The uffd daemon starts polling all uffds it has and the connection from the
>> dump side.
>>
>> When uffd is available for read, it gets the #PF info, the goes to the new
>> page_read that sends the page_server_iov request for out-of-order page (note,
>> that in case of lazy restore from images the regular page_read is used).
>>
>> Most of this code is already in criu-dev from Adrian and you, but we need to
>> add multi-uffd polling and lazy_data thing and the ability to handle "page
>> will be available later" response from the page_read.
>>
>> When dump side connection is available for reading it calls the core part of
>> the page_server_serve() routine that reads from socket and handles PS_IOV_FOO
>> commands. The page_xfer used in _this_ case is the one that finds the appropriate
>> lazy_data and calls map + wakeup ioctls.
> 
> If I understand correctly, you suggest something like this:
> 
> for (;;) {
> 	poll(uffds_sockets, 5_secs_timeout);

poll(uffds sockets and connection to dump side);

> 	if (timeout)
> 		break;
> 	if (uffd_read_available(uffds)) {
> 		page_data = get_pf_data(uffd);
> 		send_page_request(socket, page_data, O_NONBLOCK);

This O_NONBLOCK can play bad joke with us :( We'll have to poll the connection
to dump side for _writing_ and send the request.

> 	}
> 	if (socket_read_available(sockets)) {
> 		data = read_from_sockect(socket);
> 		uffd_copy(uffd, data);
> 		uffd_wake(uffd);

Yes, something like this.

> 	}
> }
>   
> I thought of rather having blocking page faults and "in order" request-response...

Well, I have strong "against" blocking anywhere. We have multiple process faulting
into uffds and a single process serving the faults. Blocking the latter guy anywhere
means blocking all the others.

>> This part is not ready and this is what I meant when was talking about re-using
>> page-server code with new page_xfer and page_read.
>>
>> Does this make sense?
>>
>> -- Pavel
>>
> 
> --
> Sincerely yours,
> Mike.
> 
> 
> .
>