[Devel] [PATCH vz9] sched: Do not set LBF_NEED_BREAK flag if scanned all the tasks

Konstantin Khorenko khorenko at virtuozzo.com
Fri Dec 15 20:21:13 MSK 2023


On 15.12.2023 12:19, Pavel Tikhomirov wrote:
> 
> 
> On 14/12/2023 06:01, Konstantin Khorenko wrote:
>> After ms commit b0defa7ae03e ("sched/fair: Make sure to try to detach at
>> least one movable task") detach_tasks() does not stop on the condition
>> (env->loop > env->loop_max) in case no movable task found.
>>
>> Instead of that (if there are no movable tasks in the rq) exits always
>> happen on the loop_break check - thus with LBF_NEED_BREAK flag set.
>>
>> It's not a problem for mainstream because load_balance() proceeds with
>> balancing in case LBF_NEED_BREAK is set only in case (env.loop <
>> busiest->nr_running), but in Virtuozzo kernel with CFS_CPULIMIT feature
>> right before that we try to move a whole task group (object of the
>> CFS_CPULIMIT feature) and resets env.loop to zero,
>> so we get a livelock here.
>>
>> Resetting env.loop makes sense in case we have successfully moved some
>> tasks (in the scope of the task group), but if we failed to move any
>> task, no progress is expected during further balancing attempts.
>>
>> Ways to fix it:
>>    1. In load_balance() restore old env.loop in case no tasks were moved
>>       by move_task_groups()
>>    2. Add a check in detach_tasks() to exit without LBF_NEED_BREAK flag in
>>       case we have scanned all tasks and have not found movable tasks.
>>
>> Current patch implements the second way.
>>
>> Caused by ms commit: b0defa7ae03e ("sched/fair: Make sure to try to
>> detach at least one movable task")
>>
>> Fixes ms commit: bca010328248 ("sched: Port CONFIG_CFS_CPULIMIT feature")
> 
> this is not "ms" commit
> 
>>
>> Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
>>
>> Feature: sched: ability to limit number of CPUs available to a CT
>> ---
>>    kernel/sched/fair.c | 6 ++++++
>>    1 file changed, 6 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 9367d16a8d85..e068bb90f197 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8771,6 +8771,12 @@ static int detach_tasks(struct lb_env *env)
>>    		if (env->loop > env->loop_max &&
>>    		    !(env->flags & LBF_ALL_PINNED))
>>    			break;
>> +		/*
>> +		 * Quit if we have scanned all tasks even in case we haven't
>> +		 * found any movable task.
>> +		 */
>> +		if (env->loop > env->src_rq->nr_running)
>> +			break;
> 
> The env->loop is clearly designed in load_balance function to be reused
> on next iteration of "goto more_balance". The fact that we clear it to
> zero in our code is really strange... Maybe we can have separate
> env.vzloop variable for our part of the code (or we can properly
> integrate with env.loop without zeroing it?)?

Well, in fact load_balance() contains 2 other places where env.loop is
reset to 0 as well and the execution is sent to "more_balance" after that.

In general i understand the idea of resetting env.loop in our case as well:
we have moved the task group (our feature) and after that we would like to
balance the load again - so resetting loop counter looks logic.
The other question is that we reset env.loop even in case we have not really
moved any task group and thus restarting the balancing process is incorrect.

i thought about checking the result of move_task_groups() and if it's zero,
just restore the env.loop counter to the value before moving the task group attempt.
May be it is worth to do, but i just decided to resurrect the previous behavior -
thus do not set LBF_NEED_BREAK flag if we reach the end of the run queue.


> Do we also need same crazy LBF_ALL_PINNED change to move_task_groups's
> "if (env->loop > env->loop_max) break;" check? Similar to what
> b0defa7ae03ec ("sched/fair: Make sure to try to detach at least one
> movable task") does in detach_tasks to allow it to handle the case then
> lru tail is pinned?

And again, i thought about it.
move_task_groups() increases env.loop only in case it founds a task to move,
so in case some tasks are pinned, env.loop is not increased and we will go further.

So in move_task_groups() we exit either
1. we scanned all groups (and did not find enough tasks to move to fix the imbalance)
2. we scanned some groups and found/moved tasks > env.loop_max
3. we scanned some groups, found/moved some tasks and their number was enough to fix the imbalance

>>    
>>    		/* take a breather every nr_migrate tasks */
>>    		if (env->loop > env->loop_break) {
> 


More information about the Devel mailing list