[CRIU] Manipulating VM areas before parasite code

Tue Jun 11 14:02:43 MSK 2019

Hello,

I'm trying to add checkpointing support for ibverbs interface (one of network interfaces with RDMA capabilities). To simplify implementation I add support only for ibverbs with SoftRoCE as a backend.

I checkpoint the kernel part of the ibverbs state by adding an additional ibverbs call, so that kernel serializes the state itself. But to restore the state, I try to reuse existing ibverbs calls.

For example, there is a concept of Memory Regions (MR) in ibvebs, that represents pinned memory that can be used for RDMA. Pinning the memory works by calling ibv_reg_mr function that pins aligned memory range that was previously allocated, for example by mmap. Registering an MR also incurs some bookkeeping from the kernel side.

I reregister the original MR by calling ibv_reg_mr with the same parameters, but for the call to work I also need to make sure that the actual memory already exists and mapped. It turned out that it is hard to guarantee the last part in CRIU.

Existing memory premapping does not work, because CRIU mapps memory first into temporary region, and then remaps it into the final destination in the parasite code. Registering memory region inside parasite code does not work for other reason, that I can explain separately.

As result, I try to modify CRIU code to create a mapping in the proper destination before the parasite code. I add an additional mmap call for VM area (struct vma_area) for areas that have at least part of it registered as ibverbs MR as follows:

	addr = mmap((void *)vma->e->start, vma_entry_len(vma->e),
		    vma->e->prot | PROT_WRITE,
		    vma->e->flags | MAP_FIXED,
		    vma->e->fd, vma->e->pgoff);
	if (addr == MAP_FAILED) {
		pr_perror("Unable to map VMA_IBVERBS");
		return -1;
	}

This call happens in premap_private_vma right before the original mmap. As result, now VMA has to mappings during the recovery.

Inside the parasite code I update vma_remap function by unmapping the normally created region, instead of remapping it to the final destination as follows:

	if (vma_entry_is(vma_entry, VMA_AREA_IBVERBS)) {
		if (guard != 0) {
			pr_err("No idea what to do with guard pages\n");
			return -1;
		}
		sys_munmap((void *)src, len);
		return 0;
	}

This code is added before "if (guard != 0) {"

Now, if I try to restore the program it almost immediately crashes. Here are last lines of the restore log:

(00.030087) Running post-restore scripts
(00.030109) Unlock network
(00.030146)     Running iptables [iptables -w -t filter -D INPUT --protocol tcp -m mark ! --mark 0xC114 --source 172.16.2.189 --sport 18515 --destination 172.16.2.127 --dport 38660 -j DROP]
(00.037551) Unlocked 172.16.2.127:38660 - 172.16.2.189:18515 connection
(00.037582)     Running iptables [iptables -w -t filter -D OUTPUT --protocol tcp -m mark ! --mark 0xC114 --source 172.16.2.127 --sport 38660 --destination 172.16.2.189 --dport 18515 -j DROP]
(00.044836) Unlocked 172.16.2.189:18515 - 172.16.2.127:38660 connection
(00.044946) pie: 18861: pie: Turning repair off for 5 (reuse 0)
(00.045007) pie: 18861: seccomp: mode 0 on tid 18861
(00.045119) Force no-breakpoints restore
(00.045144) Restore finished successfully. Resuming tasks.
(00.045184) 18861 was trapped
(00.045212) 18861 (native) is going to execute the syscall 202, required is 15
(00.045263) 18861 was trapped
(00.045280) `- Expecting exit
(00.045320) 18861 was trapped
(00.045348) 18861 (native) is going to execute the syscall 3, required is 15
(00.045401) 18861 was trapped
(00.045419) `- Expecting exit
(00.045459) 18861 was trapped
(00.045484) 18861 (native) is going to execute the syscall 3, required is 15
(00.045529) 18861 was trapped
(00.045547) `- Expecting exit
(00.045587) 18861 was trapped
(00.045612) 18861 (native) is going to execute the syscall 11, required is 15
(00.045693) 18861 was trapped
(00.045710) `- Expecting exit
(00.045760) 18861 was trapped
(00.045786) 18861 (native) is going to execute the syscall 15, required is 15
(00.045839) 18861 was stopped
(00.045928) 18861 was trapped
(00.045954) 18861 (native) is going to execute the syscall 11, required is 11
(00.046055) 18861 was stopped
(00.046089) Running pre-resume scripts
(00.046121) Writing stats
(00.046425) Running post-resume scripts

In the dmesg I see that there was a SEGFAULT:

[84616.758753] ib_send_bw[18861]: segfault at 7f7333733bf0 ip 00007f7332689579 sp 00007ffdc7a49f28 error 6 in libc-2.26.so[7f73325c6000+1ad000]
[84616.767105] Code: 05 48 3d 00 f0 ff ff 77 2a 89 d7 89 44 24 0c e8 9d ee 03 00 8b 44 24 0c 48 83 c4 18 5b 5d c3 66 90 48 8b 15 e9 c8 2e 00 f7 d8 <64> 89 02 b8 ff ff ff ff c3 48 8b 0d d7 c8 2e 00 f7 d8 64 89 01 b8

ib_send_bw is a test application I'm checkpointing.

And the problem is that I don't know how to debug this issue, because core dumps are not being created (I set ulimit -c unlimited) and strace does not give me useful information.

It seems that the program crashes almost immediately after the parasite code ends, but I also don't know where exactly.

Could you help me debugging this problem?

If required, I can share all my code (updates to CRIU, ibvebrs libraries and the kernel).

-- 
Regards,
Maksym Planeta