[Devel] [PATCH rh7] cgroups: Drop virtualization code, v4

Thu May 7 03:17:27 PDT 2015

On Thu, May 07, 2015 at 12:40:41PM +0300, Cyrill Gorcunov wrote:
> On Thu, May 07, 2015 at 12:12:37PM +0300, Cyrill Gorcunov wrote:
> > > > 
> > > > At moment we don't, but looks like we need to add some check if
> > > > cgroup been modified is not a top one when write happens from
> > > > inside of container maybe?
> > > 
> > > I guess so.
> > > 
> > > Besides, I think we should not bind mount all cgroups inside any
> > > container, because allowing a container to create an arbitrary number of
> > > cgroups can affect the overall performance badly. IMO this should be
> > > configured in the config file of a container.
> > 
> > I see, thanks. Letme think of it.
> 
> We're creating cgroups for container on ve0 but bindmount them
> from inside of container, thus on userspace level (via config file)
> we can setup which cgroups are allowed for use. Still we're not
> limiting anyhow creating new sub-cgroups (via mkdir) inside
> container, and this one should be performance penalty mainly
> (new cgroup allocation is done via direct kzalloc without
>  any memory limits as far as I understart).

Actually, it is accounted to memcg, just like any kmalloc, but the
problem isn't that we miss accounting. The problem is that the more
features we allow to use from inside a container, the more different
types of kernel objects a container can create, the more potential
security issues we have. E.g. on reclaim the kernel walks over all
memory cgroups, as a result a container user can try to DOS the node by
creating thousands of cgroups.

> Thus why we can limit cgroups set itself I don't see easy way to limit
> nested cgroups/dirs without additional kernel modification. Ideas?

Let me clarify. Currently, we agreed on the following scheme:

 - There is a parameter in the config of a CT about which controllers to
   bind mount inside the CT. By default, if there is no such a parameter
   the userspace mounts all cgroups except our home-brewed ones (ve,
   beancounter). Note, it is about the userspace only, the kernel knows
   nothing about it.

 - If a cgroup is bind mounted, the user of the container can play with
   cgroups without any limitations. It is all about trust, in fact. If
   you cannot trust a container, just disable bind mounting altogether
   in the config.

 - There is the only exception to the previous rule though. Even if we
   trust the container, we obviously don't want it to tweak its own
   parameters that are set via cgroups (e.g. its memory and swap
   limits), i.e. we should disallow it to write to files in its
   bind-mounted root. This should be done unconditionally by the kernel.
   Just disallow processes inside ve != ve0 to write files of any
   top-level cgroup.

Hope this clears things up.

A question still remains what to do with the /proc/cgroups file - we
should hide cgroups that are not bind mounted inside the CT there. This
may be done by bind mounting this file itself. Again, up to the
userspace.