[CRIU] Manipulating VM areas before parasite code

Wed Jun 12 16:13:24 MSK 2019

OK. There were couple silly bugs, but now this part seems to be fixed.

On 12/06/2019 10:59, Maksym Planeta wrote:
> At this point I changed function "unmap_old_vmas" such that it does not 
> unmap the region I previously allocated.
> 
> Now, instead of segfault, I get:
> 
> (00.047557) `- Expecting exit
> (00.047598) 30480 was trapped
> (00.047622) 30480 (native) is going to execute the syscall 11, required 
> is 11
> (00.047723) 30480 was stopped
> (00.047759) Running pre-resume scripts
> (00.047796) Writing stats
> (00.048128) Running post-resume scripts
> *** stack smashing detected ***: <unknown> terminated
> 
> And again core dump is not generated, although following commands result 
> in a core dump on the same terminal:
> 
> [ec2-user at ip-172-31-43-32 criu]$ sleep 1000 &
> [1] 30571
> [ec2-user at ip-172-31-43-32 criu]$ killall -SEGV  sleep
> [1]+  Segmentation fault      (core dumped) sleep 1000
> 
> 
> On 11/06/2019 22:57, Maksym Planeta wrote:
>> Hi,
>>
>> I think the reason for the segfault. It turns out that parasite code 
>> unmaps most of the memory (unmap_old_vmas), so whatever I map before 
>> that gets lost.
>>
>> I'll right back if the problem will not be fixed.
>>
>> On 11/06/2019 13:02, Maksym Planeta wrote:
>>> Hello,
>>>
>>> I'm trying to add checkpointing support for ibverbs interface (one of 
>>> network interfaces with RDMA capabilities). To simplify 
>>> implementation I add support only for ibverbs with SoftRoCE as a 
>>> backend.
>>>
>>> I checkpoint the kernel part of the ibverbs state by adding an 
>>> additional ibverbs call, so that kernel serializes the state itself. 
>>> But to restore the state, I try to reuse existing ibverbs calls.
>>>
>>> For example, there is a concept of Memory Regions (MR) in ibvebs, 
>>> that represents pinned memory that can be used for RDMA. Pinning the 
>>> memory works by calling ibv_reg_mr function that pins aligned memory 
>>> range that was previously allocated, for example by mmap. Registering 
>>> an MR also incurs some bookkeeping from the kernel side.
>>>
>>> I reregister the original MR by calling ibv_reg_mr with the same 
>>> parameters, but for the call to work I also need to make sure that 
>>> the actual memory already exists and mapped. It turned out that it is 
>>> hard to guarantee the last part in CRIU.
>>>
>>> Existing memory premapping does not work, because CRIU mapps memory 
>>> first into temporary region, and then remaps it into the final 
>>> destination in the parasite code. Registering memory region inside 
>>> parasite code does not work for other reason, that I can explain 
>>> separately.
>>>
>>> As result, I try to modify CRIU code to create a mapping in the 
>>> proper destination before the parasite code. I add an additional mmap 
>>> call for VM area (struct vma_area) for areas that have at least part 
>>> of it registered as ibverbs MR as follows:
>>>
>>>     addr = mmap((void *)vma->e->start, vma_entry_len(vma->e),
>>>             vma->e->prot | PROT_WRITE,
>>>             vma->e->flags | MAP_FIXED,
>>>             vma->e->fd, vma->e->pgoff);
>>>     if (addr == MAP_FAILED) {
>>>         pr_perror("Unable to map VMA_IBVERBS");
>>>         return -1;
>>>     }
>>>
>>> This call happens in premap_private_vma right before the original 
>>> mmap. As result, now VMA has to mappings during the recovery.
>>>
>>> Inside the parasite code I update vma_remap function by unmapping the 
>>> normally created region, instead of remapping it to the final 
>>> destination as follows:
>>>
>>>     if (vma_entry_is(vma_entry, VMA_AREA_IBVERBS)) {
>>>         if (guard != 0) {
>>>             pr_err("No idea what to do with guard pages\n");
>>>             return -1;
>>>         }
>>>         sys_munmap((void *)src, len);
>>>         return 0;
>>>     }
>>>
>>> This code is added before "if (guard != 0) {"
>>>
>>> Now, if I try to restore the program it almost immediately crashes. 
>>> Here are last lines of the restore log:
>>>
>>> (00.030087) Running post-restore scripts
>>> (00.030109) Unlock network
>>> (00.030146)     Running iptables [iptables -w -t filter -D INPUT 
>>> --protocol tcp -m mark ! --mark 0xC114 --source 172.16.2.189 --sport 
>>> 18515 --destination 172.16.2.127 --dport 38660 -j DROP]
>>> (00.037551) Unlocked 172.16.2.127:38660 - 172.16.2.189:18515 connection
>>> (00.037582)     Running iptables [iptables -w -t filter -D OUTPUT 
>>> --protocol tcp -m mark ! --mark 0xC114 --source 172.16.2.127 --sport 
>>> 38660 --destination 172.16.2.189 --dport 18515 -j DROP]
>>> (00.044836) Unlocked 172.16.2.189:18515 - 172.16.2.127:38660 connection
>>> (00.044946) pie: 18861: pie: Turning repair off for 5 (reuse 0)
>>> (00.045007) pie: 18861: seccomp: mode 0 on tid 18861
>>> (00.045119) Force no-breakpoints restore
>>> (00.045144) Restore finished successfully. Resuming tasks.
>>> (00.045184) 18861 was trapped
>>> (00.045212) 18861 (native) is going to execute the syscall 202, 
>>> required is 15
>>> (00.045263) 18861 was trapped
>>> (00.045280) `- Expecting exit
>>> (00.045320) 18861 was trapped
>>> (00.045348) 18861 (native) is going to execute the syscall 3, 
>>> required is 15
>>> (00.045401) 18861 was trapped
>>> (00.045419) `- Expecting exit
>>> (00.045459) 18861 was trapped
>>> (00.045484) 18861 (native) is going to execute the syscall 3, 
>>> required is 15
>>> (00.045529) 18861 was trapped
>>> (00.045547) `- Expecting exit
>>> (00.045587) 18861 was trapped
>>> (00.045612) 18861 (native) is going to execute the syscall 11, 
>>> required is 15
>>> (00.045693) 18861 was trapped
>>> (00.045710) `- Expecting exit
>>> (00.045760) 18861 was trapped
>>> (00.045786) 18861 (native) is going to execute the syscall 15, 
>>> required is 15
>>> (00.045839) 18861 was stopped
>>> (00.045928) 18861 was trapped
>>> (00.045954) 18861 (native) is going to execute the syscall 11, 
>>> required is 11
>>> (00.046055) 18861 was stopped
>>> (00.046089) Running pre-resume scripts
>>> (00.046121) Writing stats
>>> (00.046425) Running post-resume scripts
>>>
>>> In the dmesg I see that there was a SEGFAULT:
>>>
>>> [84616.758753] ib_send_bw[18861]: segfault at 7f7333733bf0 ip 
>>> 00007f7332689579 sp 00007ffdc7a49f28 error 6 in 
>>> libc-2.26.so[7f73325c6000+1ad000]
>>> [84616.767105] Code: 05 48 3d 00 f0 ff ff 77 2a 89 d7 89 44 24 0c e8 
>>> 9d ee 03 00 8b 44 24 0c 48 83 c4 18 5b 5d c3 66 90 48 8b 15 e9 c8 2e 
>>> 00 f7 d8 <64> 89 02 b8 ff ff ff ff c3 48 8b 0d d7 c8 2e 00 f7 d8 64 
>>> 89 01 b8
>>>
>>> ib_send_bw is a test application I'm checkpointing.
>>>
>>> And the problem is that I don't know how to debug this issue, because 
>>> core dumps are not being created (I set ulimit -c unlimited) and 
>>> strace does not give me useful information.
>>>
>>> It seems that the program crashes almost immediately after the 
>>> parasite code ends, but I also don't know where exactly.
>>>
>>> Could you help me debugging this problem?
>>>
>>> If required, I can share all my code (updates to CRIU, ibvebrs 
>>> libraries and the kernel).
>>>
>>
> 

-- 
Regards,
Maksym Planeta