[CRIU] The case for unprivileges writes to ns_last_pid
Nicolas Viennot
Nicolas.Viennot at twosigma.com
Wed Feb 5 23:49:59 MSK 2020
Dear all,
I would like to get us closer to running CRIU without requiring privileged
permissions. One of the road block is writing to /proc/sys/kernel/ns_last_pid
The kernel requires CAP_SYS_ADMIN in the current pid namespace. This is also
true for using sys_clone3().
I don't think it's actually useful to enforce such security rule. We can
achieve the same without privileges. Essentially, by doing may forks(), we can
cycle through pids at a rate of 100,000 pid/s. Here's a tool I wrote using
this technique: https://github.com/twosigma/set_ns_last_pid
The default value for pid_max is 32768, so it takes ~300ms to cycle through all
pids. This suggests that a program can easily bypass CAP_SYS_ADMIN to control
the next pid. This approach has the limitation that the pids warp to 300
(defined as RESERVED_PIDS in the kernel), instead of pid 1, so this technique is
ineffective to control pids smaller than 300.
Per my research, this pid wrap-around at 300 was implemented to prevent a user
from DoS-ing the machine. A user could use all the pids on the machine,
preventing an administrator to login on the machine. Only the root user would be
able to allocate pids < 300. In current kernels, the root user gets pid=300 on
wrap-around, so this RESERVED_PIDS logic serves no purpose.
1) I was wondering if we can make the case that the CAP_SYS_ADMIN check could be
removed on the kernel mailing list. I suggest the following proposals:
a) Any process can control the next pids of its current and nested namespaces,
without CAP_SYS_ADMIN.
If this is too strong, then I suggest the following:
b) Only let the init process (pid=1) to control the next pids without
privileges. Other processes still need CAP_SYS_ADMIN.
2) Orthogonally, I suggest to take out RESERVED_PIDS and let pids wrap at 1.
What do you guys think?
I'm happy to write the kernel patches.
Nico
More information about the CRIU
mailing list