[CRIU] Remote lazy-restore design discussion

Fri Apr 8 09:41:18 PDT 2016

On 04/08/2016 05:51 PM, Adrian Reber wrote:
> On Thu, Apr 07, 2016 at 03:37:30PM +0300, Pavel Emelyanov wrote:
>> On 04/06/2016 10:38 AM, Adrian Reber wrote:
>>> On Tue, Apr 05, 2016 at 07:04:45PM +0300, Pavel Emelyanov wrote:
>>>> Well, how about this:
>>>>
>>>> I. Dump side.
>>>>
>>>> The criu dump process dumps everything but the lazy pagemaps, lazy pagemaps
>>>> are skipped and are queued.
>>>
>>> Agreed.
>>>
>>>> Then criu dump spawns a daemon that opens a connection to the remote host,
>>>> creates page_server_xfer, takes the queue of pagemaps-s that are to be sent
>>>> to it and starts polling the xfer socket.
>>>
>>> Not sure I already understand this. There is now daemon running which
>>> has access to the memory of the dumped process which is still at the
>>> original place in the memory of the dumped process.
>>
>> Yes.
>>
>>> This is something I
>>> see as important to make sure the pages of the dumped process are copied
>>> as seldom as possible.
>>
>> Absolutely agree.
>>
>>> Why is the daemon connecting to the restore process and not the other
>>> way around?
>>
>> Well, this is how dump --page-server already does -- dump side connects
>> to restore side. So I thought that doing symmetrical thing for lazy pages
>> would make sense.
>>
>>> >From what I have done so far it seems more logical the other way around
>>> then you described.
>>>
>>>  * First the process is dumped (without the lazy pages).
>>
>> And what about lazy pages? Where are they? In the dump-side images
>> or in memory?
> 
> I would hope (not checked if really possible) the lazy pages can stay in
> the process which is currently dumped.

OK.

>>>  * Second the dumped information is transferred (scp/rsync) to the
>>>    destination.
>>
>> Note, that _some_ memory contents will be sent to page server using
>> dump-connects-to-restore method.
> 
> Not sure I understand this.

I meant that while doing the final dump some pages' contents will be sent
to the destination node via the page server. And thus, at restore time, will
be present there as on-disk images (or on-tmpfs images).

>>>  * Third, on the destination host, the process is restored in lazy pages
>>>    mode and now the uffd page handler connects to the dump daemon on the
>>>    source host.
>>
>> Hm... OK.
>>
>>>  * The process is now restored.
>>>
>>> >From my current understanding it makes no sense that the dumping process
>>> connects to the uffd process on the destination system as it is unknown
>>> when this will be available.
>>
>> Well, OK, from this perspective it may be useful to make restore-connect-to-dump.
>> But this shouldn't affect the described model, since it mostly describes what
>> happens once sides are interconnected.
> 
> That's true.
> 
>>>> When available for read, it gets request for particular pagemap and pops one
>>>> up in the queue.
>>>>
>>>> When available for write it gets the next pagemap from queue and ->writepage
>>>> one to the page_xfer.
>>>>
>>>> II. Restore side
>>>>
>>>> The uffd daemon is spawned, it opens a port (to which dump will connect, or
>>>> uses the opts.ps_socket provided connection, the connect_to_page_server()
>>>> knows this), creates a hash with (pid, uffd, pagemaps) structures (called
>>>> lazy_data below) and listens.
>>>>
>>>> Restore prepares processes and mappings (Adrian's code already does this), sending
>>>> uffd-s to the uffd daemon (already there).
>>>>
>>>> The uffd daemon starts polling all uffds it has and the connection from the
>>>> dump side.
>>>>
>>>> When uffd is available for read, it gets the #PF info, the goes to the new
>>>> page_read that sends the page_server_iov request for out-of-order page (note,
>>>> that in case of lazy restore from images the regular page_read is used).
>>>>
>>>> Most of this code is already in criu-dev from Adrian and you, but we need to
>>>> add multi-uffd polling and lazy_data thing and the ability to handle "page
>>>> will be available later" response from the page_read.
>>>>
>>>> When dump side connection is available for reading it calls the core part of
>>>> the page_server_serve() routine that reads from socket and handles PS_IOV_FOO
>>>> commands. The page_xfer used in _this_ case is the one that finds the appropriate
>>>> lazy_data and calls map + wakeup ioctls.
>>>>
>>>> This part is not ready and this is what I meant when was talking about re-using
>>>> page-server code with new page_xfer and page_read.
>>>>
>>>> Does this make sense?
>>>
>>> I am confused about the which side connects to the other side.
>>
>> OK :) Let's then try to resolve this issue.
>>
>> I don't have sting arguments for dump->restore connection, since if
>> you look at how p.haul works, it doesn't use this criu connect feature, 
>> it passes one a pre-established descriptor. On both sides. So question
>> which side connects to which can be solved either way.
> 
> Yes, probably right. Which side connects to which is not really
> important. That was just the first point which seemed wrong from the
> order of steps I expected.
> 
> I thought some more about the different current designs currently
> available. Currently I can use my first implementation which provides
> its own protocol for page exchange (uffd-struct-based) and the one based
> upon Mike's page server client (page-server-based).

I would unify page-server and uffd protocols at least in terms of
messages they exchange.

> In my first implementation (uffd-struct-based) the logic which pages
> should be copied was running on the source system (uffd-remote-server).
> It reacted on requests and transferred unrequested pages at the end.
> To know which pages need to be transferred it had to parse the
> checkpoint directory.

Or keep the information obtained while doing "dump" action. No?

> The uffd-daemon on the destination side was just
> forwarding pages to and from uffd and the network socket. It needed to
> know how to handle uffd requests but it did not require any knowledge of
> the actual checkpoint and about which pages are available.
> 
> In the page-server-based-remote-restore the page-server-on-the-dest-host
> has to have knowledge how to get which page and the uffd-daemon on the
> destination side has to parse the checkpoint directory to know which pages
> are part of the restored process and which pages have not yet been
> transferred.
> 
> I am trying to say that in my original not-page-server-related
> implementation the uffd daemon has no need to know the details about the
> pages and in the page-server-based implementation all included
> parts/daemons/page-servers need to know which pages need to be
> transferred.

Ah, so your question is who should decide which page to push into the socket
next -- uffd-side (restore-side) or the dump-side?

> So, I am not sure how important this argument is, but I just wanted to
> mention it for completeness as this aspect (that uffd daemon does not
> need to read/access/parse the checkpoint directory) reduces the number
> of processes involved in the restore with access to the checkpoint
> directory from three to two.
> 
> 		Adrian
> .
> 

-- Pavel