[CRIU] Restore may fail due to PID number overflow

Pavel Emelyanov xemul at virtuozzo.com
Sat Feb 20 06:33:22 PST 2016


On 02/20/2016 05:22 PM, Evgenii Shatokhin wrote:
> Hi,
> 
> When CRIU is used to checkpoint and then restore a number of processes 
> (live zdtm tests, actually) running in their own pid namespace, restore 
> fails with the following error in rare cases:
> 
>     Error (cr-restore.c:1573): Pid 300 do not match expected 32768
> 
> I am new to CRIU and cannot say right now how to fix this properly, so 
> your suggestions are appreciated.
> 
> As far as I can see in the code, the problem is in
> pstree.c, prepare_pstree_ids():
> -------------
> 	/* Try to find helpers, who should be connected to the leader */
> 	list_for_each_entry(child, &helpers, sibling) {
> 		if (child->state != TASK_HELPER)
> 			continue;
> 
> 		if (child->sid != item->sid)
> 			continue;
> 
> 		child->pgid = item->pgid;
> 		child->pid.virt = ++max_pid;
> 		child->parent = item;
> 		list_move(&child->sibling, &item->children);
> 
> 		pr_info("Attach %d to the task %d\n",
> 				child->pid.virt, item->pid.virt);
> 
> 		break;
> 	}
> -------------
> 
> max_pid may become 32768 after the increment, and this value is saved in 
> child->pid.virt.
> 
> However, when that process is spawned, the OS cannot give it PID number 
> greater than or equal to the maximum (/proc/sys/kernel/pid_max contains 
> 32768 in that case). Thus the OS gives it the smallest unused PID number 
> not less than 300, as it should.
> 
> When that process executes restore_task_with_children() (cr-restore.c), 
> it compares its stored and real PID numbers, sees the mismatch and 
> reports failure:
> -------------
> pid = getpid();
> if (current->pid.virt != pid) {
> 	pr_err("Pid %d do not match expected %d\n", pid,
> 		current->pid.virt);
> 	set_task_cr_err(EEXIST);
> 	goto err;
> }
> -------------
> 
> If I hack prepare_pstree_ids() as follows, the problem is gone, but, 
> obviously, this is not a proper solution:
> -------------
>                          child->pgid = item->pgid;
> -                       child->pid.virt = ++max_pid;
> +
> +                       max_pid++;
> +                       if (max_pid == 32768)
> +                               max_pid = 300;
> +
> +                       child->pid.virt = max_pid;
>                          child->parent = item;
> -------------
> 
> As for the maximum value of PIDs, one can get it from 
> /proc/sys/kernel/pid_max, I suppose.

Well, the intention of this code is to find _unused_ pid :) It was the
simplest way to do it, not the correct one.

> The tricky part is how to find the smallest unused PID number >= 300 at 
> that point. Any ideas?

There's a patch set from Andrey that should fix this issue.

https://lists.openvz.org/pipermail/criu/2016-February/025584.html

-- Pavel


More information about the CRIU mailing list