[CRIU] [Qemu-devel] [PATCH 00/10] RFC: userfault

Thu Jul 3 07:08:53 PDT 2014

Hi Christopher,

On Thu, Jul 03, 2014 at 09:45:07AM -0400, Christopher Covington wrote:
> CRIU uses the soft dirty bit in /proc/pid/clear_refs and /proc/pid/pagemap to
> implement its pre-copy memory migration.
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/soft-dirty.txt
> 
> Would it make sense to use a similar interaction model of peeking and poking
> at /proc/pid/ files for post-copy memory migration facilities?

We plan to use the pagemap information to optimize precopy live
migration, but that's orthogonal with postcopy live migration.

We already combine precopy and postcopy live migration.

In addition to the dirty bit tracking with softdirty clear_refs
feature, the pagemap bits can also tell for example which pages are
missing in the source node, instead of the current memcmp(0) that
avoids to transfer zero pages. With pagemap we can skip a superfluous
zero page fault (David suggested this).

Postcopy live migration poses a different problem. And without
postcopy there's no way to migrate 100GByte guests with heavy load
inside them, in fact even the first "optimistic" precopy pass should
only migrate those pages that already got the dirty bit set by the
time we attempt to send them.

With postcopy we can also guarantee that the maximum amount of data
transferred during precopy+postcopy is twice the size of the guest. So
you know exactly the maximum time live migration will take depending
on your network bandwidth and it cannot fail no matter the load or the
size of the guest. Slowing down the guest with autoconverge isn't
needed anymore.

The userfault only happens in the destination node. The problem we
face is that we must start the guest in the destination node despite
significant amount of its memory is still in the source node.

With postcopy migration the pages aren't dirty nor present in the
destination node, they're just holes, and in fact we already exactly
know which are missing without having to check pagemap.

It's up to the guest OS which pages it decides to touch, we cannot
know. We already know where are holes, we don't know if the guest will
touch the holes during its runtime while the memory is still
externalized.

If the guest touches any hole we need to stop the guest somehow and we
must be let know immediately so we transfer the page, fill the hole,
and let it continue ASAP.

pagemap/clear_refs can't stop the guest and let us know immediately
about the fact the guest touched a hole.

It's not just about the guest shadow mmu accesses, it could also be
O_DIRECT from qemu that triggers the fault and in that case GUP stops,
we fill the hole and then GUP and O_DIRECT succeeds without even
noticing it has been stopped by an userfault.

Thanks,
Andrea