[CRIU] Remote lazy-restore design discussion

Adrian Reber adrian at lisas.de
Thu Mar 31 07:25:42 PDT 2016


Hello Pavel,

after Mike asked if there have been any design discussions and after I
am not 100% sure how the page-server fits into the remote restore, it
seems to be a good idea to get a common understanding what the right
implementation for remote lazy-restore should look like.

I am using my implementation as a starting point for the discussion.

I think we need three different process for remote lazy restore
independent of how they are started. 'destination system' is the system
the process should be migrated to and 'source system' is the system the
original process was running on before the migration.

 1. The actual restore process (destination system):
     This is a 'normal' restore with the difference that memory pages
     (MAP_ANONYMOUS and MAP_PRIVATE) are not copied to their place
     but they are marked as being handled by userfaultfd. Therefore
     a userfaultfd FD (UFFD) is opened and passed to a second process.

 2. The local lazy restore UFFD handler (destination system):
     This process listens on the UFFD for userfault requests and tries to
     handle the userfault requests. Either by reading the required pages
     from a local checkpoint (rather unlikely use case) or it is requesting
     the pages from a remote system (source system) via the network.

 3. The remote lazy restore page request handler (source system):
     This process opens a network port and listens for page requests
     and reads the requested pages from a local checkpoint (or even
     better, directly from a stopped process).


As this describes the solution I have implemented it all sounds correct
to me. In addition to handling request for pages (processes 2. and 3.)
both page handlers need to know how to push unrequested pages at some
point in time to make sure the migration can finish.

Looking at the page-server it is currently not clear to me how it fits
into this scenario. Currently it listens on a network port (like process
3. from above) and writes the received pages to the local disk.

To serve as the process mention as process 3. it would need to learn all
the functionality as it has currently been implemented.

Instead of receiving pages and writing it to disk it needs to
receive page requests and read the from disk to the network.

This sounds like the opposite of what it is currently doing and,
from my point of view, it is either a complete separate process,
like my implementation, or all the functionality needs to be added.
Also the logic to handle unrequested pages does not seem like
something which the page-server can currently do or is designed to do.

So, from my point of view, page-server and remote page request handler
seem rather different in their functionality (besides being a TCP
server). I suppose there are some points I am not seeing so I hope to
understand the situation better from the answers to this email. Thanks.

		Adrian


More information about the CRIU mailing list