[Devel] [PATCH rh7] cgroups: Drop virtualization code, v4
Cyrill Gorcunov
gorcunov at odin.com
Thu May 7 03:29:59 PDT 2015
On Thu, May 07, 2015 at 01:17:27PM +0300, Vladimir Davydov wrote:
> > We're creating cgroups for container on ve0 but bindmount them
> > from inside of container, thus on userspace level (via config file)
> > we can setup which cgroups are allowed for use. Still we're not
> > limiting anyhow creating new sub-cgroups (via mkdir) inside
> > container, and this one should be performance penalty mainly
> > (new cgroup allocation is done via direct kzalloc without
> > any memory limits as far as I understart).
>
> Actually, it is accounted to memcg, just like any kmalloc, but the
> problem isn't that we miss accounting. The problem is that the more
I see, it's deep inside of slab/slub code, thanks.
> features we allow to use from inside a container, the more different
> types of kernel objects a container can create, the more potential
> security issues we have. E.g. on reclaim the kernel walks over all
> memory cgroups, as a result a container user can try to DOS the node by
> creating thousands of cgroups.
So maybe we should limit the number of nested cgroups in container?
There is root->number_of_cgroups maybe we should setup some limit
on ve config.
> > Thus why we can limit cgroups set itself I don't see easy way to limit
> > nested cgroups/dirs without additional kernel modification. Ideas?
>
> Let me clarify. Currently, we agreed on the following scheme:
>
> - There is a parameter in the config of a CT about which controllers to
> bind mount inside the CT. By default, if there is no such a parameter
> the userspace mounts all cgroups except our home-brewed ones (ve,
> beancounter). Note, it is about the userspace only, the kernel knows
> nothing about it.
yes
> - If a cgroup is bind mounted, the user of the container can play with
> cgroups without any limitations. It is all about trust, in fact. If
> you cannot trust a container, just disable bind mounting altogether
> in the config.
>
> - There is the only exception to the previous rule though. Even if we
> trust the container, we obviously don't want it to tweak its own
> parameters that are set via cgroups (e.g. its memory and swap
> limits), i.e. we should disallow it to write to files in its
> bind-mounted root. This should be done unconditionally by the kernel.
> Just disallow processes inside ve != ve0 to write files of any
> top-level cgroup.
yes, i'm testing it
>
> Hope this clears things up.
>
> A question still remains what to do with the /proc/cgroups file - we
> should hide cgroups that are not bind mounted inside the CT there. This
> may be done by bind mounting this file itself. Again, up to the
> userspace.
ok, once finish with previous will back to this one
More information about the Devel
mailing list