[CRIU] Remote lazy-restore design discussion

Sun Apr 10 23:12:06 PDT 2016

On Tue, Apr 05, 2016 at 07:04:45PM +0300, Pavel Emelyanov wrote:
> On 04/04/2016 07:20 PM, Mike Rapoport wrote:
> > On Mon, Apr 04, 2016 at 04:06:50PM +0300, Pavel Emelyanov wrote:
> >> On 03/31/2016 05:25 PM, Adrian Reber wrote:
> >>
> >> I would also add that process 3 should not only listen for page requests, but
> >> also send other pages in the background. Probably the ideal process 3 should
> >>
> >> 1. Have a queue of pages to be sent (struct page_server_iov-s)
> >> 2. Fill it with pages that were not transfered (ANON|PRIVATE)
> >> 3. Start sending them one by one
> >> 4. Receive messages from the process #3 that can move some items from
> >>    the queue on top (i.e. -- the pages that are needed right now)
> > 
> > Well, I actually thought more about "pull" than "push" approach. The pages
> > are anyway collected into pagemap.img and it may be shared between source
> > and destination. Than page-read on the restore (destination) side works
> > almost as now, just instead of read(fd, ...) it does recv(sock, ...).
> > I have some ugly POC (below) that kinda demonstrates the idea.
> > 
> > If I understand you idea correctly, the dump side requires addition of
> > beckground process that will handle random pages requests.
> 
> Well, yes. Think of it this way -- the restore side daemon will have to
> poll for many uffds and handle incoming data from socket. So it will be
> a state machine that keeps track of pids, associated uffds and the reasons
> for which they are hung. So adding there background traffic shouldn't be
> a big deal.

What do you mean by "background" traffic on the restore side? The pages
that are actually faulted or the rest of the memory?
I think that the transfer of the faulted pages should be blocking.

[ snip ] 

> Well, how about this:
> 
> I. Dump side.
> 
> The criu dump process dumps everything but the lazy pagemaps, lazy pagemaps
> are skipped and are queued.

By pagemaps you mean the content of pagemap images + actual pages? Isn't
pagemap image required on the restore side anyway?

> Then criu dump spawns a daemon that opens a connection to the remote host,
> creates page_server_xfer, takes the queue of pagemaps-s that are to be sent
> to it and starts polling the xfer socket.
> 
> When available for read, it gets request for particular pagemap and pops one
> up in the queue.

How many pages are you planning to send upon request? The entire sequence
that contains the faulted page, just the page that was requested, or, say,
the page that was requested plus two pages before and two pages after?

> When available for write it gets the next pagemap from queue and ->writepage
> one to the page_xfer.
> 
> II. Restore side
> 
> The uffd daemon is spawned, it opens a port (to which dump will connect, or
> uses the opts.ps_socket provided connection, the connect_to_page_server()
> knows this), creates a hash with (pid, uffd, pagemaps) structures (called
> lazy_data below) and listens.
> 
> Restore prepares processes and mappings (Adrian's code already does this), sending
> uffd-s to the uffd daemon (already there).
> 
> The uffd daemon starts polling all uffds it has and the connection from the
> dump side.
> 
> When uffd is available for read, it gets the #PF info, the goes to the new
> page_read that sends the page_server_iov request for out-of-order page (note,
> that in case of lazy restore from images the regular page_read is used).
> 
> Most of this code is already in criu-dev from Adrian and you, but we need to
> add multi-uffd polling and lazy_data thing and the ability to handle "page
> will be available later" response from the page_read.
> 
> When dump side connection is available for reading it calls the core part of
> the page_server_serve() routine that reads from socket and handles PS_IOV_FOO
> commands. The page_xfer used in _this_ case is the one that finds the appropriate
> lazy_data and calls map + wakeup ioctls.

If I understand correctly, you suggest something like this:

for (;;) {
	poll(uffds_sockets, 5_secs_timeout);
	if (timeout)
		break;
	if (uffd_read_available(uffds)) {
		page_data = get_pf_data(uffd);
		send_page_request(socket, page_data, O_NONBLOCK);
	}
	if (socket_read_available(sockets)) {
		data = read_from_sockect(socket);
		uffd_copy(uffd, data);
		uffd_wake(uffd);
	}
}

I thought of rather having blocking page faults and "in order" request-response...

> This part is not ready and this is what I meant when was talking about re-using
> page-server code with new page_xfer and page_read.
> 
> Does this make sense?
> 
> -- Pavel
> 

--
Sincerely yours,
Mike.