[Devel] [PATCH rh7] cgroups: Drop virtualization code, v4

Thu May 7 03:29:59 PDT 2015

On Thu, May 07, 2015 at 01:17:27PM +0300, Vladimir Davydov wrote:
> > We're creating cgroups for container on ve0 but bindmount them
> > from inside of container, thus on userspace level (via config file)
> > we can setup which cgroups are allowed for use. Still we're not
> > limiting anyhow creating new sub-cgroups (via mkdir) inside
> > container, and this one should be performance penalty mainly
> > (new cgroup allocation is done via direct kzalloc without
> >  any memory limits as far as I understart).
> 
> Actually, it is accounted to memcg, just like any kmalloc, but the
> problem isn't that we miss accounting. The problem is that the more

I see, it's deep inside of slab/slub code, thanks.

> features we allow to use from inside a container, the more different
> types of kernel objects a container can create, the more potential
> security issues we have. E.g. on reclaim the kernel walks over all
> memory cgroups, as a result a container user can try to DOS the node by
> creating thousands of cgroups.

So maybe we should limit the number of nested cgroups in container?
There is root->number_of_cgroups maybe we should setup some limit
on ve config.

> > Thus why we can limit cgroups set itself I don't see easy way to limit
> > nested cgroups/dirs without additional kernel modification. Ideas?
> 
> Let me clarify. Currently, we agreed on the following scheme:
> 
>  - There is a parameter in the config of a CT about which controllers to
>    bind mount inside the CT. By default, if there is no such a parameter
>    the userspace mounts all cgroups except our home-brewed ones (ve,
>    beancounter). Note, it is about the userspace only, the kernel knows
>    nothing about it.

yes

>  - If a cgroup is bind mounted, the user of the container can play with
>    cgroups without any limitations. It is all about trust, in fact. If
>    you cannot trust a container, just disable bind mounting altogether
>    in the config.
> 
>  - There is the only exception to the previous rule though. Even if we
>    trust the container, we obviously don't want it to tweak its own
>    parameters that are set via cgroups (e.g. its memory and swap
>    limits), i.e. we should disallow it to write to files in its
>    bind-mounted root. This should be done unconditionally by the kernel.
>    Just disallow processes inside ve != ve0 to write files of any
>    top-level cgroup.

yes, i'm testing it

> 
> Hope this clears things up.
> 
> A question still remains what to do with the /proc/cgroups file - we
> should hide cgroups that are not bind mounted inside the CT there. This
> may be done by bind mounting this file itself. Again, up to the
> userspace.

ok, once finish with previous will back to this one