[Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups

Tue Jun 16 08:19:44 PDT 2015

В Чт, 14/05/2015 в 19:52 +0300, Cyrill Gorcunov пишет:
> In vzctl/libvzctl bundle we restore container like
> 
>  - create ve/$ctid cgroup
>  - move self into this cgroup
>  - run criu from inside
> 
> So that kernel code passes ve_can_attach test. In turn for
> our P.Haul project (which is managing live migration) the
> situation is different -- it opens ve/$ctid but moves
> criu service pid instead (so that the service will
> start restore procedure). Which leads to situation
> where ve_can_attach fails with -EINVAL.
> 
> Reported-by: Nikita Spiridonov <nspiridonov at odin.com>
> Signed-off-by: Cyrill Gorcunov <gorcunov at odin.com>
> CC: Vladimir Davydov <vdavydov at odin.com>
> CC: Konstantin Khorenko <khorenko at odin.com>
> CC: Pavel Emelyanov <xemul at odin.com>
> CC: Andrey Vagin <avagin at odin.com>
> ---
> 
> Guys, could you please take a look, especially from
> security POV, is it safe to remove all these checks?
> 
>  kernel/ve/ve.c |   31 +++++++++++++------------------
>  1 file changed, 13 insertions(+), 18 deletions(-)
> 
> Index: linux-pcs7.git/kernel/ve/ve.c
> ===================================================================
> --- linux-pcs7.git.orig/kernel/ve/ve.c
> +++ linux-pcs7.git/kernel/ve/ve.c
> @@ -750,13 +750,6 @@ static void ve_destroy(struct cgroup *cg
>  static int ve_can_attach(struct cgroup *cg, struct cgroup_taskset *tset)
>  {
>  	struct ve_struct *ve = cgroup_ve(cg);
> -	struct task_struct *task = current;
> -
> -	if (cgroup_taskset_size(tset) != 1 ||
> -	    cgroup_taskset_first(tset) != task ||
> -	    !thread_group_leader(task) ||
> -	    !thread_group_empty(task))
> -		return -EINVAL;

This patch brings a couple of problems. The first one is if we're setting
a ve cgroup for a forking task, it's possible the parent and the child
fall into different cgroups. For example, parent is inside CT, while
child is in ve0.

The second problem is that it is racy with all places, where we're doing
get_exec_env() and hoping on a stable result. There are a lot of places,
we use it, so every place need too be checked.

The first problem can be solved using cgroup_postfork() method, while
the second is worse. The number of get_exec_env() is big, and it tends
to increase.  We could check all of them, but an universal solutions
will be better. But it seems there is no easy universal solution.

We can implement a task lock and use get_exec_env() under it. 
We can use something like rcu_read_{lock,unlock} braces around get_exec_env(),
and use synchronize_rcu() in cgroup_postfork().

The two solutions above are examples what we can do. What do you think?

>  
>  	if (ve->is_locked)
>  		return -EBUSY;
> @@ -775,20 +768,22 @@ static int ve_can_attach(struct cgroup *
>  static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset)
>  {
>  	struct ve_struct *ve = cgroup_ve(cg);
> -	struct task_struct *tsk = current;
> -
> -	/* this probihibts ptracing of task entered to VE from host system */
> -	if (ve->is_running && tsk->mm)
> -		tsk->mm->vps_dumpable = VD_VE_ENTER_TASK;
> +	struct task_struct *tsk;
>  
> -	/* Drop OOM protection. */
> -	tsk->signal->oom_score_adj = 0;
> -	tsk->signal->oom_score_adj_min = 0;
> +	cgroup_taskset_for_each(tsk, cg, tset) {
> +		/* this probihibts ptracing of task entered to VE from host system */
> +		if (ve->is_running && tsk->mm)
> +			tsk->mm->vps_dumpable = VD_VE_ENTER_TASK;
> +
> +		/* Drop OOM protection. */
> +		tsk->signal->oom_score_adj = 0;
> +		tsk->signal->oom_score_adj_min = 0;
>  
> -	/* Leave parent exec domain */
> -	tsk->parent_exec_id--;
> +		/* Leave parent exec domain */
> +		tsk->parent_exec_id--;
>  
> -	tsk->task_ve = ve;
> +		tsk->task_ve = ve;
> +	}
>  }
>  
>  static int ve_state_read(struct cgroup *cg, struct cftype *cft,
> _______________________________________________

Kirill