[Devel] [PATCH v2] cgroup/cpuset: emulate cgroup in container

Thu Dec 14 11:08:41 MSK 2017

Other idea is - maybe we can not fake cpuset cgroup, but just allow 
these controller in container?

Main Idea of faking/hiding cpuset was: cpuset is not virtuallized(we 
don't have virtual processors) so container can bind itself to physical 
cpus and memory nodes. If several containers bind to same cpu they will 
end up competing for these cpu resources, it can influence performance 
badly. https://jira.sw.ru/browse/PSBM-30541

But AFAIKS performance is degraded only for containers which setup 
cpuset badly, all others are still scheduled on all cores and are fine. 
So we protect customers from themselves.

We can even add a feature to enable/disable cpuset per CT, e.g. vzctl 
sets ve.cpuset_enabled in ve cgroup before it's start, and after that 
from ve cgroup ctinit mounts cpuset in CT if it is listed in 
/proc/cgroups. Note we also need to do the same on criu restore.

On 12/13/2017 07:52 PM, Stanislav Kinsburskiy wrote:
> Any changes to this cgroup are skipped in container, but success code is
> returned.
> The idea is to fool Docker/Kubernetes.
> 
> https://jira.sw.ru/browse/PSBM-58423
> 
> This patch obsoletes "ve/proc/cpuset: do not show cpuset in CT"
> 
> v2:
> Do not attach tasks in cpuset_change_cpumask on cpuset set change, it
> requested from non-super VE.
> This is a second part of the logic.
> The first was to not change cpuset for newly added task. This one - to not
> set new cpuset for all the tasks in cgroup
> 
> Signed-off-by: Stanislav Kinsburskiy <skinsbursky at virtuozzo.com>
> ---
>   kernel/cpuset.c |   12 ++++++++++++
>   1 file changed, 12 insertions(+)
> 
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 26d88eb..43b1410 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -848,6 +848,9 @@ static int cpuset_test_cpumask(struct task_struct *tsk,
>   static void cpuset_change_cpumask(struct task_struct *tsk,
>   				  struct cgroup_scanner *scan)
>   {
> +	if (!ve_is_super(get_exec_env()))
> +		return;
> +

Likely we have to do the same for nodemask too if we choose to fake 
cpuset cgroup, and maybe some others:

ls /sys/fs/cgroup/cpuset/cpuset.*
/sys/fs/cgroup/cpuset/cpuset.cpu_exclusive
/sys/fs/cgroup/cpuset/cpuset.cpus
/sys/fs/cgroup/cpuset/cpuset.mem_exclusive
/sys/fs/cgroup/cpuset/cpuset.mem_hardwall
/sys/fs/cgroup/cpuset/cpuset.memory_migrate
/sys/fs/cgroup/cpuset/cpuset.memory_pressure
/sys/fs/cgroup/cpuset/cpuset.memory_pressure_enabled
/sys/fs/cgroup/cpuset/cpuset.memory_spread_page
/sys/fs/cgroup/cpuset/cpuset.memory_spread_slab
/sys/fs/cgroup/cpuset/cpuset.mems
/sys/fs/cgroup/cpuset/cpuset.sched_load_balance
/sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_leve

>   	set_cpus_allowed_ptr(tsk, ((cgroup_cs(scan->cg))->cpus_allowed));
>   }
>   
> @@ -1441,6 +1444,9 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
>   	struct task_struct *task;
>   	int ret;
>   
> +	if (!ve_is_super(get_exec_env()))
> +		return 0;
> +
>   	mutex_lock(&cpuset_mutex);
>   
>   	ret = -ENOSPC;
> @@ -1470,6 +1476,9 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
>   static void cpuset_cancel_attach(struct cgroup *cgrp,
>   				 struct cgroup_taskset *tset)
>   {
> +	if (!ve_is_super(get_exec_env()))
> +		return;
> +
>   	mutex_lock(&cpuset_mutex);
>   	cgroup_cs(cgrp)->attach_in_progress--;
>   	mutex_unlock(&cpuset_mutex);
> @@ -1494,6 +1503,9 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
>   	struct cpuset *cs = cgroup_cs(cgrp);
>   	struct cpuset *oldcs = cgroup_cs(oldcgrp);
>   
> +	if (!ve_is_super(get_exec_env()))
> +		return;
> +
>   	mutex_lock(&cpuset_mutex);
>   
>   	/* prepare for attach */
> 

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.