[CRIU] Restore may fail due to PID number overflow
Pavel Emelyanov
xemul at virtuozzo.com
Sat Feb 20 06:33:22 PST 2016
On 02/20/2016 05:22 PM, Evgenii Shatokhin wrote:
> Hi,
>
> When CRIU is used to checkpoint and then restore a number of processes
> (live zdtm tests, actually) running in their own pid namespace, restore
> fails with the following error in rare cases:
>
> Error (cr-restore.c:1573): Pid 300 do not match expected 32768
>
> I am new to CRIU and cannot say right now how to fix this properly, so
> your suggestions are appreciated.
>
> As far as I can see in the code, the problem is in
> pstree.c, prepare_pstree_ids():
> -------------
> /* Try to find helpers, who should be connected to the leader */
> list_for_each_entry(child, &helpers, sibling) {
> if (child->state != TASK_HELPER)
> continue;
>
> if (child->sid != item->sid)
> continue;
>
> child->pgid = item->pgid;
> child->pid.virt = ++max_pid;
> child->parent = item;
> list_move(&child->sibling, &item->children);
>
> pr_info("Attach %d to the task %d\n",
> child->pid.virt, item->pid.virt);
>
> break;
> }
> -------------
>
> max_pid may become 32768 after the increment, and this value is saved in
> child->pid.virt.
>
> However, when that process is spawned, the OS cannot give it PID number
> greater than or equal to the maximum (/proc/sys/kernel/pid_max contains
> 32768 in that case). Thus the OS gives it the smallest unused PID number
> not less than 300, as it should.
>
> When that process executes restore_task_with_children() (cr-restore.c),
> it compares its stored and real PID numbers, sees the mismatch and
> reports failure:
> -------------
> pid = getpid();
> if (current->pid.virt != pid) {
> pr_err("Pid %d do not match expected %d\n", pid,
> current->pid.virt);
> set_task_cr_err(EEXIST);
> goto err;
> }
> -------------
>
> If I hack prepare_pstree_ids() as follows, the problem is gone, but,
> obviously, this is not a proper solution:
> -------------
> child->pgid = item->pgid;
> - child->pid.virt = ++max_pid;
> +
> + max_pid++;
> + if (max_pid == 32768)
> + max_pid = 300;
> +
> + child->pid.virt = max_pid;
> child->parent = item;
> -------------
>
> As for the maximum value of PIDs, one can get it from
> /proc/sys/kernel/pid_max, I suppose.
Well, the intention of this code is to find _unused_ pid :) It was the
simplest way to do it, not the correct one.
> The tricky part is how to find the smallest unused PID number >= 300 at
> that point. Any ideas?
There's a patch set from Andrey that should fix this issue.
https://lists.openvz.org/pipermail/criu/2016-February/025584.html
-- Pavel
More information about the CRIU
mailing list