[CRIU] Restore may fail due to PID number overflow
Evgenii Shatokhin
eshatokhin at virtuozzo.com
Sat Feb 20 06:22:06 PST 2016
Hi,
When CRIU is used to checkpoint and then restore a number of processes
(live zdtm tests, actually) running in their own pid namespace, restore
fails with the following error in rare cases:
Error (cr-restore.c:1573): Pid 300 do not match expected 32768
I am new to CRIU and cannot say right now how to fix this properly, so
your suggestions are appreciated.
As far as I can see in the code, the problem is in
pstree.c, prepare_pstree_ids():
-------------
/* Try to find helpers, who should be connected to the leader */
list_for_each_entry(child, &helpers, sibling) {
if (child->state != TASK_HELPER)
continue;
if (child->sid != item->sid)
continue;
child->pgid = item->pgid;
child->pid.virt = ++max_pid;
child->parent = item;
list_move(&child->sibling, &item->children);
pr_info("Attach %d to the task %d\n",
child->pid.virt, item->pid.virt);
break;
}
-------------
max_pid may become 32768 after the increment, and this value is saved in
child->pid.virt.
However, when that process is spawned, the OS cannot give it PID number
greater than or equal to the maximum (/proc/sys/kernel/pid_max contains
32768 in that case). Thus the OS gives it the smallest unused PID number
not less than 300, as it should.
When that process executes restore_task_with_children() (cr-restore.c),
it compares its stored and real PID numbers, sees the mismatch and
reports failure:
-------------
pid = getpid();
if (current->pid.virt != pid) {
pr_err("Pid %d do not match expected %d\n", pid,
current->pid.virt);
set_task_cr_err(EEXIST);
goto err;
}
-------------
If I hack prepare_pstree_ids() as follows, the problem is gone, but,
obviously, this is not a proper solution:
-------------
child->pgid = item->pgid;
- child->pid.virt = ++max_pid;
+
+ max_pid++;
+ if (max_pid == 32768)
+ max_pid = 300;
+
+ child->pid.virt = max_pid;
child->parent = item;
-------------
As for the maximum value of PIDs, one can get it from
/proc/sys/kernel/pid_max, I suppose.
The tricky part is how to find the smallest unused PID number >= 300 at
that point. Any ideas?
Regards,
Evgenii
More information about the CRIU
mailing list