[CRIU] The case for unprivileges writes to ns_last_pid

Fri Feb 7 11:26:17 MSK 2020

ср, 5 февр. 2020 г. в 23:51, Nicolas Viennot <Nicolas.Viennot at twosigma.com>:
>
> Dear all,
>
> I would like to get us closer to running CRIU without requiring privileged
> permissions. One of the road block is writing to /proc/sys/kernel/ns_last_pid
> The kernel requires CAP_SYS_ADMIN in the current pid namespace. This is also
> true for using sys_clone3().

Looking on code we require CAP_SYS_ADMIN in ->user_ns-es of each pid namespace
we want to set_tid in.

If we would at some point recover the work of migrating nested pid and user
namespaces (https://lists.openvz.org/pipermail/criu/2017-April/037028.html) and
if we want to use less racy clone3 set_tid path for it, we would sulery face a
problem with current permissions. Imagine we need to fork an init of nested pid
namespace which at the same time was owned by nested userns, we would call
clone3 from nested userns (no other option to setup proper ownership), so we
don't have capability to set pid in e.g. root pidns of CT which is owned by
root userns of CT.

That is sad.

>
> I don't think it's actually useful to enforce such security rule. We can
> achieve the same without privileges. Essentially, by doing may forks(), we can
> cycle through pids at a rate of 100,000 pid/s. Here's a tool I wrote using
> this technique: https://github.com/twosigma/set_ns_last_pid
> The default value for pid_max is 32768, so it takes ~300ms to cycle through all
> pids. This suggests that a program can easily bypass CAP_SYS_ADMIN to control
> the next pid. This approach has the limitation that the pids warp to 300
> (defined as RESERVED_PIDS in the kernel), instead of pid 1, so this technique is
> ineffective to control pids smaller than 300.

For nested pid namespaces you would likely never get the proper pids
on each level with these strategy.

>
> Per my research, this pid wrap-around at 300 was implemented to prevent a user
> from DoS-ing the machine. A  user could use all the pids on the machine,
> preventing an administrator to login on the machine. Only the root user would be
> able to allocate pids < 300. In current kernels, the root user gets pid=300 on
> wrap-around, so this RESERVED_PIDS logic serves no purpose.
>
> 1) I was wondering if we can make the case that the CAP_SYS_ADMIN check could be
>    removed on the kernel mailing list. I suggest the following proposals:
>
>    a) Any process can control the next pids of its current and nested namespaces,
>       without CAP_SYS_ADMIN.
>
>    If this is too strong, then I suggest the following:
>
>    b) Only let the init process (pid=1) to control the next pids without
>       privileges. Other processes still need CAP_SYS_ADMIN.
>
> 2) Orthogonally, I suggest to take out RESERVED_PIDS and let pids wrap at 1.
>
> What do you guys think?
>
> I'm happy to write the kernel patches.
>
> Nico
>
> _______________________________________________
> CRIU mailing list
> CRIU at openvz.org
> https://lists.openvz.org/mailman/listinfo/criu