[CRIU] [PATCH v5 3/5] Try to include userfaultfd with criu (part 1)

Tue Mar 15 02:03:07 PDT 2016

On 03/15/2016 10:35 AM, Adrian Reber wrote:
> On Tue, Mar 15, 2016 at 01:33:56AM +0300, Pavel Emelyanov wrote:
>> On 03/15/2016 12:02 AM, Adrian Reber wrote:
>>> On Mon, Mar 14, 2016 at 11:42:11AM +0300, Pavel Emelyanov wrote:
>>>> On 03/12/2016 07:54 PM, Adrian Reber wrote:
>>>>> On Fri, Mar 11, 2016 at 06:25:52PM +0300, Pavel Emelyanov wrote:
>>>>>> On 03/11/2016 06:03 PM, Adrian Reber wrote:
>>>>>>> On Fri, Mar 11, 2016 at 04:08:10PM +0300, Pavel Emelyanov wrote:
>>>>>>>>
>>>>>>>>> +static void criu_init()
>>>>>>>>> +{
>>>>>>>>> +	/* TODO: return code checking */
>>>>>>>>> +	check_img_inventory();
>>>>>>>>> +	prepare_task_entries();
>>>>>>>>> +	prepare_pstree();
>>>>>>>>> +	collect_remaps_and_regfiles();
>>>>>>>>> +	prepare_shared_reg_files();
>>>>>>>>> +	prepare_remaps();
>>>>>>>>> +	prepare_mm_pid(root_item);
>>>>>>>>> +
>>>>>>>>> +	/* We found a PID */
>>>>>>>>> +	pr_debug("root_item->pid.virt %d\n", root_item->pid.virt);
>>>>>>>>> +	pr_debug("root_item->pid.real %d\n", root_item->pid.real);
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> This portion should be really resolved before merging. All of the above
>>>>>>>> has nothing to do with the page_read, so please, find the reason for
>>>>>>>> page read engine non working due to absence of this. If you need help
>>>>>>>> with the code, just drop me an e-mail, I'll help.
>>>>>>>
>>>>>>> I had a quick look, but need to look a bit in more detail.
>>>>>>>
>>>>>>> If I leave away all those lines I get a segfault, I haven't checked yet
>>>>>>> but I think when accessing root_item->pid.virt.
>>>>>>
>>>>>> Ah! Indeed. You open the page read for init task only. I believe the proper
>>>>>> fix would be to pass the pid of the process via socket you use to pass uffd.
>>>>>>
>>>>>> Since we'll have to do it anyway in the future, I think this is worth doing
>>>>>> from the very beginning. And the lazy pages daemon should accept only one
>>>>>> such message (for you initial case).
>>>>>
>>>>> Getting the information about which directory contains the checkpoint
>>>>> from the main restore process via the same mechanism as the userfaultfd
>>>>> FD was also my initial plan. But, unfortunately, the lazy-pages server
>>>>> needs to open the checkpoint directory on its own. Especially as I am
>>>>> currently working on the code for remote lazy-restore.
>>>>
>>>> No no no, I'm not talking about passing the directory with images via socket,
>>>> but about finding out the PID of the task to work on. You get this value (the
>>>> pid) from root_item, but this value should go via socket as a raw integer,
>>>> together with the uffd descriptor.
>>>>
>>>> This line from patch #3, uffd.c file, uffd_listen() function:
>>>>
>>>>> +	rc = open_page_read(root_item->pid.virt, &pr, PR_TASK);
>>>>
>>>> there should not be any root_item-> dereferences, instead, the value of pid.virt
>>>> should be sent by the criu restore here:
>>>
>>> Ah, okay. I understand now. If I do not use root_item->pid.virt to get
>>> the PID but if I get it from somewhere else (hardcoded for a quick
>>> test). My code still behaves the same (segfaults, doesn't work, strange
>>> error message) as before if I do not call the functions we are
>>> discussing. So I do not need it to get the PID but looping over the VMAs
>>> of the checkpoint specified with -D just doesn't work.
>>>
>>> In my test case, I get for example this output
>>>
>>> (02.306731) Opened page read 1 (parent 0)
>>> (02.306733) lazy-pages: iov.iov_base 0x400000 (1 pages)
>>> (02.306734) lazy-pages: iov.iov_base 0x600000 (2 pages)
>>> (02.306736) lazy-pages: iov.iov_base 0x1ead000 (1 pages)
>>> (02.306737) lazy-pages: iov.iov_base 0x7fbdaac63000 (8 pages)
>>> (02.306738) lazy-pages: iov.iov_base 0x7fbdaac6c000 (1 pages)
>>> (02.306739) lazy-pages: iov.iov_base 0x7fbdaae83000 (3 pages)
>>> (02.306741) lazy-pages: iov.iov_base 0x7fbdaae8d000 (5 pages)
>>> (02.306742) lazy-pages: iov.iov_base 0x7ffc33a55000 (2 pages)
>>> (02.306743) lazy-pages: iov.iov_base 0x7ffc33a58000 (1 pages)
>>> (02.306744) lazy-pages: iov.iov_base 0x7ffc33a5d000 (2 pages)
>>> (02.306747) lazy-pages: Found 0 pages to be handled by UFFD
>>>
>>> and not single page has been detected as MAP_ANONYMOUS and MAP_PRIVATE.
>>
>> But do you really need to scan vmas in lazy pages daemon? Why not just
>> respond with ANY vaddr request on the uffd. For this you'd only need
>> the page-read engine.
>>
>> I mean the collect_uffd_pages code's list_for_each_entry(vma, &vmas->h, list)
>> loop.
> 
> I need to know which pages do exist to be able to push unrequested pages
> at some point into the restored process. To be able to stop the
> lazy-pages daemon at some point because the process is happily running I
> need to transfer pages, which have not yet been requested, into the
> process. Some restored process might not touch certain pages maybe for a
> long time (maybe hours or longer) and it would be good to have the
> possibility to be able to end the UFFD daemon at some point. Therefore I
> have to scan all existing pages if they are UFFD eligible and I have to
> track if they have been requested. If they have not been requested after
> some time I have to push the pages into the restored process to be able
> to finish the UFFD daemon.

Hm... You're right, the termination point should be declared. To get this
I have two suggestions, not sure which one would work better though.

The first is to teach lazy daemon to accept not just uffd + pid, but also
the regions in which the client is willing to get memory.

The second suggestion is to just open the mm.img file and pull the vma entries
into the lazy daemon. There's no such helper in criu code yet, so introducing
one would be required.

And, not to block the progress, I suggest to do this: you fix the existing
patches so that they just get rid of root_item dereference and teach the
criu to lazy daemon protocol to pass pid. Then I merge them into criu-dev
branch and wait for the patch that simplifies the vma-s management in
either of the above ways? Then we proceed to multi-processes and other
stuff. Does this sound OK for you?

-- Pavel