[CRIU] The case for unprivileges writes to ns_last_pid

Adrian Reber adrian at lisas.de
Thu Feb 6 10:15:34 MSK 2020


On Wed, Feb 05, 2020 at 08:49:59PM +0000, Nicolas Viennot wrote:
> I would like to get us closer to running CRIU without requiring privileged
> permissions. One of the road block is writing to /proc/sys/kernel/ns_last_pid
> The kernel requires CAP_SYS_ADMIN in the current pid namespace. This is also
> true for using sys_clone3().

Thanks for starting this discussion. This comes up regularly and it was
again a topic at last years Linux Plumbers Conference.

My first implementation of clone3() with set_tid was actually without
CAP_SYS_ADMIN to get rid of it. Getting the clone3() with set_tid patch
into the kernel without CAP_SYS_ADMIN was unfortunately not possible.
My goal at that time was definitely to pick this up at some later point.

At Linux Plumbers Conference Google, who are using CRIU internally, also
said that they would be interested in something like CAP_RESTORE instead
of CAP_SYS_ADMIN. This way you could have it behind a capability but not
a capability as big as CAP_SYS_ADMIN.

> I don't think it's actually useful to enforce such security rule. We can
> achieve the same without privileges. Essentially, by doing may forks(), we can
> cycle through pids at a rate of 100,000 pid/s. Here's a tool I wrote using
> this technique: https://github.com/twosigma/set_ns_last_pid
> The default value for pid_max is 32768, so it takes ~300ms to cycle through all
> pids. This suggests that a program can easily bypass CAP_SYS_ADMIN to control
> the next pid. This approach has the limitation that the pids warp to 300
> (defined as RESERVED_PIDS in the kernel), instead of pid 1, so this technique is
> ineffective to control pids smaller than 300.

Concerning pid_max. On the Fedora systems I am using this has already
been bumped to 4194304.

> Per my research, this pid wrap-around at 300 was implemented to prevent a user
> from DoS-ing the machine. A  user could use all the pids on the machine,
> preventing an administrator to login on the machine. Only the root user would be
> able to allocate pids < 300. In current kernels, the root user gets pid=300 on
> wrap-around, so this RESERVED_PIDS logic serves no purpose.
> 
> 1) I was wondering if we can make the case that the CAP_SYS_ADMIN check could be
>    removed on the kernel mailing list. I suggest the following proposals:
> 
>    a) Any process can control the next pids of its current and nested namespaces,
>       without CAP_SYS_ADMIN.
> 
>    If this is too strong, then I suggest the following:
> 
>    b) Only let the init process (pid=1) to control the next pids without
>       privileges. Other processes still need CAP_SYS_ADMIN.
> 
> 2) Orthogonally, I suggest to take out RESERVED_PIDS and let pids wrap at 1.
> 
> What do you guys think?

Go for it!

> I'm happy to write the kernel patches.

I think completely removing any capability will not be accepted. You
can still try it. I like the idea of having a dedicated capability for
checkpoint/restore, which could then be added to the CRIU binary.

		Adrian


More information about the CRIU mailing list