[CRIU] Manipulating VM areas before parasite code

Thu Jun 13 21:15:23 MSK 2019

Hi Maksym,

On Wed, Jun 12, 2019 at 6:14 AM Maksym Planeta
<mplaneta at os.inf.tu-dresden.de> wrote:
>
> OK. There were couple silly bugs, but now this part seems to be fixed.

Sorry for the late response. It is good to know that you could manage
the problem yourself.

>
> On 12/06/2019 10:59, Maksym Planeta wrote:
> > At this point I changed function "unmap_old_vmas" such that it does not
> > unmap the region I previously allocated.
> >
> > Now, instead of segfault, I get:
> >
> > (00.047557) `- Expecting exit
> > (00.047598) 30480 was trapped
> > (00.047622) 30480 (native) is going to execute the syscall 11, required
> > is 11
> > (00.047723) 30480 was stopped
> > (00.047759) Running pre-resume scripts
> > (00.047796) Writing stats
> > (00.048128) Running post-resume scripts
> > *** stack smashing detected ***: <unknown> terminated
> >
> > And again core dump is not generated, although following commands result
> > in a core dump on the same terminal:
> >
> > [ec2-user at ip-172-31-43-32 criu]$ sleep 1000 &
> > [1] 30571
> > [ec2-user at ip-172-31-43-32 criu]$ killall -SEGV  sleep
> > [1]+  Segmentation fault      (core dumped) sleep 1000
> >
> >
> > On 11/06/2019 22:57, Maksym Planeta wrote:
> >> Hi,
> >>
> >> I think the reason for the segfault. It turns out that parasite code
> >> unmaps most of the memory (unmap_old_vmas), so whatever I map before
> >> that gets lost.
> >>
> >> I'll right back if the problem will not be fixed.
> >>
> >> On 11/06/2019 13:02, Maksym Planeta wrote:
> >>> Hello,
> >>>
> >>> I'm trying to add checkpointing support for ibverbs interface (one of
> >>> network interfaces with RDMA capabilities). To simplify
> >>> implementation I add support only for ibverbs with SoftRoCE as a
> >>> backend.
> >>>
> >>> I checkpoint the kernel part of the ibverbs state by adding an
> >>> additional ibverbs call, so that kernel serializes the state itself.
> >>> But to restore the state, I try to reuse existing ibverbs calls.
> >>>
> >>> For example, there is a concept of Memory Regions (MR) in ibvebs,
> >>> that represents pinned memory that can be used for RDMA. Pinning the
> >>> memory works by calling ibv_reg_mr function that pins aligned memory
> >>> range that was previously allocated, for example by mmap. Registering
> >>> an MR also incurs some bookkeeping from the kernel side.
> >>>
> >>> I reregister the original MR by calling ibv_reg_mr with the same
> >>> parameters, but for the call to work I also need to make sure that
> >>> the actual memory already exists and mapped. It turned out that it is
> >>> hard to guarantee the last part in CRIU.
> >>>
> >>> Existing memory premapping does not work, because CRIU mapps memory
> >>> first into temporary region, and then remaps it into the final
> >>> destination in the parasite code. Registering memory region inside
> >>> parasite code does not work for other reason, that I can explain
> >>> separately.
> >>>
> >>> As result, I try to modify CRIU code to create a mapping in the
> >>> proper destination before the parasite code. I add an additional mmap
> >>> call for VM area (struct vma_area) for areas that have at least part
> >>> of it registered as ibverbs MR as follows:
> >>>
> >>>     addr = mmap((void *)vma->e->start, vma_entry_len(vma->e),
> >>>             vma->e->prot | PROT_WRITE,
> >>>             vma->e->flags | MAP_FIXED,
> >>>             vma->e->fd, vma->e->pgoff);
> >>>     if (addr == MAP_FAILED) {
> >>>         pr_perror("Unable to map VMA_IBVERBS");
> >>>         return -1;
> >>>     }
> >>>
> >>> This call happens in premap_private_vma right before the original
> >>> mmap. As result, now VMA has to mappings during the recovery.
> >>>
> >>> Inside the parasite code I update vma_remap function by unmapping the
> >>> normally created region, instead of remapping it to the final
> >>> destination as follows:
> >>>
> >>>     if (vma_entry_is(vma_entry, VMA_AREA_IBVERBS)) {
> >>>         if (guard != 0) {
> >>>             pr_err("No idea what to do with guard pages\n");
> >>>             return -1;
> >>>         }
> >>>         sys_munmap((void *)src, len);
> >>>         return 0;
> >>>     }
> >>>
> >>> This code is added before "if (guard != 0) {"
> >>>
> >>> Now, if I try to restore the program it almost immediately crashes.
> >>> Here are last lines of the restore log:
> >>>
> >>> (00.030087) Running post-restore scripts
> >>> (00.030109) Unlock network
> >>> (00.030146)     Running iptables [iptables -w -t filter -D INPUT
> >>> --protocol tcp -m mark ! --mark 0xC114 --source 172.16.2.189 --sport
> >>> 18515 --destination 172.16.2.127 --dport 38660 -j DROP]
> >>> (00.037551) Unlocked 172.16.2.127:38660 - 172.16.2.189:18515 connection
> >>> (00.037582)     Running iptables [iptables -w -t filter -D OUTPUT
> >>> --protocol tcp -m mark ! --mark 0xC114 --source 172.16.2.127 --sport
> >>> 38660 --destination 172.16.2.189 --dport 18515 -j DROP]
> >>> (00.044836) Unlocked 172.16.2.189:18515 - 172.16.2.127:38660 connection
> >>> (00.044946) pie: 18861: pie: Turning repair off for 5 (reuse 0)
> >>> (00.045007) pie: 18861: seccomp: mode 0 on tid 18861
> >>> (00.045119) Force no-breakpoints restore
> >>> (00.045144) Restore finished successfully. Resuming tasks.
> >>> (00.045184) 18861 was trapped
> >>> (00.045212) 18861 (native) is going to execute the syscall 202,
> >>> required is 15
> >>> (00.045263) 18861 was trapped
> >>> (00.045280) `- Expecting exit
> >>> (00.045320) 18861 was trapped
> >>> (00.045348) 18861 (native) is going to execute the syscall 3,
> >>> required is 15
> >>> (00.045401) 18861 was trapped
> >>> (00.045419) `- Expecting exit
> >>> (00.045459) 18861 was trapped
> >>> (00.045484) 18861 (native) is going to execute the syscall 3,
> >>> required is 15
> >>> (00.045529) 18861 was trapped
> >>> (00.045547) `- Expecting exit
> >>> (00.045587) 18861 was trapped
> >>> (00.045612) 18861 (native) is going to execute the syscall 11,
> >>> required is 15
> >>> (00.045693) 18861 was trapped
> >>> (00.045710) `- Expecting exit
> >>> (00.045760) 18861 was trapped
> >>> (00.045786) 18861 (native) is going to execute the syscall 15,
> >>> required is 15
> >>> (00.045839) 18861 was stopped
> >>> (00.045928) 18861 was trapped
> >>> (00.045954) 18861 (native) is going to execute the syscall 11,
> >>> required is 11
> >>> (00.046055) 18861 was stopped
> >>> (00.046089) Running pre-resume scripts
> >>> (00.046121) Writing stats
> >>> (00.046425) Running post-resume scripts
> >>>
> >>> In the dmesg I see that there was a SEGFAULT:
> >>>
> >>> [84616.758753] ib_send_bw[18861]: segfault at 7f7333733bf0 ip
> >>> 00007f7332689579 sp 00007ffdc7a49f28 error 6 in
> >>> libc-2.26.so[7f73325c6000+1ad000]
> >>> [84616.767105] Code: 05 48 3d 00 f0 ff ff 77 2a 89 d7 89 44 24 0c e8
> >>> 9d ee 03 00 8b 44 24 0c 48 83 c4 18 5b 5d c3 66 90 48 8b 15 e9 c8 2e
> >>> 00 f7 d8 <64> 89 02 b8 ff ff ff ff c3 48 8b 0d d7 c8 2e 00 f7 d8 64
> >>> 89 01 b8
> >>>
> >>> ib_send_bw is a test application I'm checkpointing.
> >>>
> >>> And the problem is that I don't know how to debug this issue, because
> >>> core dumps are not being created (I set ulimit -c unlimited) and
> >>> strace does not give me useful information.
> >>>
> >>> It seems that the program crashes almost immediately after the
> >>> parasite code ends, but I also don't know where exactly.
> >>>
> >>> Could you help me debugging this problem?
> >>>
> >>> If required, I can share all my code (updates to CRIU, ibvebrs
> >>> libraries and the kernel).
> >>>
> >>
> >
>
> --
> Regards,
> Maksym Planeta
> _______________________________________________
> CRIU mailing list
> CRIU at openvz.org
> https://lists.openvz.org/mailman/listinfo/criu