[Devel] [PATCH rh7 1/3] ve: Attach a ve cgroup from task_work

Wed Oct 28 07:36:48 PDT 2015

On 28.10.2015 15:40, Vladimir Davydov wrote:
> On Wed, Oct 28, 2015 at 12:53:44PM +0300, Kirill Tkhai wrote:
>> On 27.10.2015 19:13, Vladimir Davydov wrote:
>>> On Fri, Oct 16, 2015 at 07:24:49PM +0300, Kirill Tkhai wrote:
>>> ...
>>>> @@ -733,6 +742,13 @@ static int ve_can_attach(struct cgroup *cg, struct cgroup_taskset *tset)
>>>>  		}
>>>>  	}
>>>>  
>>>> +	cgroup_taskset_for_each(task, cg, tset) {
>>>> +		/* Attaching is in process */
>>>> +		if (task->task_ve_work.func != NULL)
>>>> +			return -EAGAIN;
>>>
>>> Failing write to cgroup.procs with EAGAIN looks unexpected. For sure,
>>> nobody will retry.
>>
>> This is a protection from circular attaches, and real vzctl never bumped in this
>> case. But kernel should be safe in any userspace behaviour.
> 
> Wouldn't it be better to just cancel the pending work if it happens
> instead of returning an error to userspace?

No. See next paragraph for the details.

>>
>> EAGAIN is correct return value for writing to cgroup.procs from userspace. For example,
>> the latest mainstream pids_can_fork() uses it.
> 
> I don't see how it is related. pids_can_fork() is called from fork(),
> which is known to return EAGAIN on resource shortage. What does EAGAIN
> mean in your case? Try again later? Nobody will.

EAGAIN means "attaching is in progress, try again later". If there are
two vzctl (the first one, who is waiting for queued attach finish, and the
second one, who is trying to do the same right now) one of them must fail.
You prefer the first? I don't. It's more complicated and have no advantages.

As for pid_can_fork(), sure, it's not for cgroup.procs. But the thing is
any write may fail, which is known for every userspace programmer, and
nowhere in cgroup comments/docs this is prohibited. In generally, .can_attach
method was implemented for situations like this. Why cgroup needs it if attach
never fails? No reason.

Also, it's not possible to wait for previous attach finish in ve_can_attach(),
because we're owning cgroup_mutex, threadgroup_lock() and do not know which
code is executed by the previous task.

>>  
>>>> +	}
>>>> +	smp_mb(); /* Pairs with smp_mb() in ve_attach_work() */
>>>> +
>>>>  	return 0;
>>>>  }
>>>>  
>>>> @@ -753,7 +769,8 @@ static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset)
>>>>  		/* Leave parent exec domain */
>>>>  		task->parent_exec_id--;
>>>>  
>>>> -		task->task_ve = ve;
>>>> +		init_task_work(&task->task_ve_work, ve_attach_work);
>>>
>>> Why do you need to keep the work on task_struct? It's already huge, so
>>> let's please try not to make it even bigger. I don't see any reason why
>>> kmalloc-ing a task_work wouldn't suffice in this case. Moreover, AFAICS
>>> it would liberate us from the necessity to use memory barriers and allow
>>> us to avoid EAGAIN failures.
>>
>> Current task_struct size is 3232 bytes, and the patchset adds 16 byte more:
>>
>> 16/3232 = 0.00495(495).
>>
>> It does not look the set dramatically increases task_struct size and affects
>> the performance in any way.
> 
> Nice calculations, but no, it doesn't work for me. You add 16 bytes now.
> Somebody else adds another 16 bytes, because she can't get on without

It's a common thinking in a style "someone adds, someone deletes".
We do not add task_struct elements too often. This is the place,
where kmalloc() complicates the thing much more than it could to be.

> them. Another 16 bytes a bit later, and we finally have
> 
> 3232 + 16 * 3 = 32800

Last zero is wrong.

> 
> Only 1% overhead you say, but it means 10 objects won't fit in one slab
> any more so it turns out to be a 10% loss.
> 
> About those 16 bytes I added for the same purpose. Well, I'm bad. Nobody
> stopped me unfortunately. This definitely needs to be fixed as well, and
> I will once I have time.
>
> That said, my point is, as we can easily avoid this wastage, we
> definitely should.

It's not easily.

Suspending a CT?! Lineary dependence on memory. cgroup_mutex is blocked,
and nobody can't fork of exit. Delays because of IO operations.

No. The global mutex *must* be locked as few as possible. No memory operations.

>>
>> At the same time kmalloc() complicates the code, because it forces us
>> to introduce .cancel_attach method which now is not existed. It's need
>> to handle the ENOMEM situation (imagine a big multi-thread process).
> 
> Those barriers do complicate the code even worse, if you ask me. If
> you're lazy and don't want to handle ENOMEM, use __GFP_NOFAIL. It won't
> make any difference, because small allocations don't fail anyway.

No. We can't do IO operations under global mutex. It hangs all the system
and worsen the performance.

>>
>>> Hell, I placed ub's cgroup_attach_work in task_struct either, what a
>>> mess...
>>
>> Probably, we may use the only attach work in task_struct for ve and ub
>> and overload it.
> 
> No, that was my mistake to place task_work in task_struct. I'll get rid
> of that.
> 
>>
>>> Besides, I don't see you handle a race with fork (see ub_cgroup_fork).
>>> Why?
>>
>> Because the race is a BUG existed long before my patches, and it's not connected
>> with them. Don't afraid, I hasn't forgotten it.
>>
>> The series is about delayed attaching. Let's do not mix everything here.
> 
> The fix should only take a couple of lines, to the best of my knowledge,
> so why not just do it right away.

It's about other thing, and should go as a separate patch. I don't think
it's more pleasant for reviewers to receive 4 patches of (possible) next
versions instead of 3.

>>  
>>>> +		task_work_add(task, &task->task_ve_work, true);
>>>>  	}
>>>>  }
>>>>  
>>>>
>>