[Devel] [PATCH rh7 1/3] ve: Attach a ve cgroup from task_work

Wed Oct 28 05:40:51 PDT 2015

On Wed, Oct 28, 2015 at 12:53:44PM +0300, Kirill Tkhai wrote:
> On 27.10.2015 19:13, Vladimir Davydov wrote:
> > On Fri, Oct 16, 2015 at 07:24:49PM +0300, Kirill Tkhai wrote:
> > ...
> >> @@ -733,6 +742,13 @@ static int ve_can_attach(struct cgroup *cg, struct cgroup_taskset *tset)
> >>  		}
> >>  	}
> >>  
> >> +	cgroup_taskset_for_each(task, cg, tset) {
> >> +		/* Attaching is in process */
> >> +		if (task->task_ve_work.func != NULL)
> >> +			return -EAGAIN;
> > 
> > Failing write to cgroup.procs with EAGAIN looks unexpected. For sure,
> > nobody will retry.
> 
> This is a protection from circular attaches, and real vzctl never bumped in this
> case. But kernel should be safe in any userspace behaviour.

Wouldn't it be better to just cancel the pending work if it happens
instead of returning an error to userspace?

> 
> EAGAIN is correct return value for writing to cgroup.procs from userspace. For example,
> the latest mainstream pids_can_fork() uses it.

I don't see how it is related. pids_can_fork() is called from fork(),
which is known to return EAGAIN on resource shortage. What does EAGAIN
mean in your case? Try again later? Nobody will.

>  
> >> +	}
> >> +	smp_mb(); /* Pairs with smp_mb() in ve_attach_work() */
> >> +
> >>  	return 0;
> >>  }
> >>  
> >> @@ -753,7 +769,8 @@ static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset)
> >>  		/* Leave parent exec domain */
> >>  		task->parent_exec_id--;
> >>  
> >> -		task->task_ve = ve;
> >> +		init_task_work(&task->task_ve_work, ve_attach_work);
> > 
> > Why do you need to keep the work on task_struct? It's already huge, so
> > let's please try not to make it even bigger. I don't see any reason why
> > kmalloc-ing a task_work wouldn't suffice in this case. Moreover, AFAICS
> > it would liberate us from the necessity to use memory barriers and allow
> > us to avoid EAGAIN failures.
> 
> Current task_struct size is 3232 bytes, and the patchset adds 16 byte more:
> 
> 16/3232 = 0.00495(495).
> 
> It does not look the set dramatically increases task_struct size and affects
> the performance in any way.

Nice calculations, but no, it doesn't work for me. You add 16 bytes now.
Somebody else adds another 16 bytes, because she can't get on without
them. Another 16 bytes a bit later, and we finally have

3232 + 16 * 3 = 32800

Only 1% overhead you say, but it means 10 objects won't fit in one slab
any more so it turns out to be a 10% loss.

About those 16 bytes I added for the same purpose. Well, I'm bad. Nobody
stopped me unfortunately. This definitely needs to be fixed as well, and
I will once I have time.

That said, my point is, as we can easily avoid this wastage, we
definitely should.

> 
> At the same time kmalloc() complicates the code, because it forces us
> to introduce .cancel_attach method which now is not existed. It's need
> to handle the ENOMEM situation (imagine a big multi-thread process).

Those barriers do complicate the code even worse, if you ask me. If
you're lazy and don't want to handle ENOMEM, use __GFP_NOFAIL. It won't
make any difference, because small allocations don't fail anyway.

> 
> > Hell, I placed ub's cgroup_attach_work in task_struct either, what a
> > mess...
> 
> Probably, we may use the only attach work in task_struct for ve and ub
> and overload it.

No, that was my mistake to place task_work in task_struct. I'll get rid
of that.

> 
> > Besides, I don't see you handle a race with fork (see ub_cgroup_fork).
> > Why?
> 
> Because the race is a BUG existed long before my patches, and it's not connected
> with them. Don't afraid, I hasn't forgotten it.
> 
> The series is about delayed attaching. Let's do not mix everything here.

The fix should only take a couple of lines, to the best of my knowledge,
so why not just do it right away.

>  
> >> +		task_work_add(task, &task->task_ve_work, true);
> >>  	}
> >>  }
> >>  
> >>
>