[Devel] [PATCH vz9] sched: Do not set LBF_NEED_BREAK flag if scanned all the tasks

Mon Dec 18 09:39:59 MSK 2023

On 16/12/2023 01:21, Konstantin Khorenko wrote:
> On 15.12.2023 12:19, Pavel Tikhomirov wrote:
>>
>>
>> On 14/12/2023 06:01, Konstantin Khorenko wrote:
>>> After ms commit b0defa7ae03e ("sched/fair: Make sure to try to detach at
>>> least one movable task") detach_tasks() does not stop on the condition
>>> (env->loop > env->loop_max) in case no movable task found.
>>>
>>> Instead of that (if there are no movable tasks in the rq) exits always
>>> happen on the loop_break check - thus with LBF_NEED_BREAK flag set.
>>>
>>> It's not a problem for mainstream because load_balance() proceeds with
>>> balancing in case LBF_NEED_BREAK is set only in case (env.loop <
>>> busiest->nr_running), but in Virtuozzo kernel with CFS_CPULIMIT feature
>>> right before that we try to move a whole task group (object of the
>>> CFS_CPULIMIT feature) and resets env.loop to zero,
>>> so we get a livelock here.
>>>
>>> Resetting env.loop makes sense in case we have successfully moved some
>>> tasks (in the scope of the task group), but if we failed to move any
>>> task, no progress is expected during further balancing attempts.
>>>
>>> Ways to fix it:
>>>    1. In load_balance() restore old env.loop in case no tasks were moved
>>>       by move_task_groups()
>>>    2. Add a check in detach_tasks() to exit without LBF_NEED_BREAK 
>>> flag in
>>>       case we have scanned all tasks and have not found movable tasks.
>>>
>>> Current patch implements the second way.
>>>
>>> Caused by ms commit: b0defa7ae03e ("sched/fair: Make sure to try to
>>> detach at least one movable task")
>>>
>>> Fixes ms commit: bca010328248 ("sched: Port CONFIG_CFS_CPULIMIT 
>>> feature")
>>
>> this is not "ms" commit
>>
>>>

Reviewed-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>

>>> Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
>>>
>>> Feature: sched: ability to limit number of CPUs available to a CT
>>> ---
>>>    kernel/sched/fair.c | 6 ++++++
>>>    1 file changed, 6 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 9367d16a8d85..e068bb90f197 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8771,6 +8771,12 @@ static int detach_tasks(struct lb_env *env)
>>>            if (env->loop > env->loop_max &&
>>>                !(env->flags & LBF_ALL_PINNED))
>>>                break;
>>> +        /*
>>> +         * Quit if we have scanned all tasks even in case we haven't
>>> +         * found any movable task.
>>> +         */
>>> +        if (env->loop > env->src_rq->nr_running)
>>> +            break;
>>
>> The env->loop is clearly designed in load_balance function to be reused
>> on next iteration of "goto more_balance". The fact that we clear it to
>> zero in our code is really strange... Maybe we can have separate
>> env.vzloop variable for our part of the code (or we can properly
>> integrate with env.loop without zeroing it?)?
> 
> Well, in fact load_balance() contains 2 other places where env.loop is
> reset to 0 as well and the execution is sent to "more_balance" after that.

Agree.

> 
> In general i understand the idea of resetting env.loop in our case as well:
> we have moved the task group (our feature) and after that we would like to
> balance the load again - so resetting loop counter looks logic.
> The other question is that we reset env.loop even in case we have not 
> really
> moved any task group and thus restarting the balancing process is 
> incorrect.
> 
> i thought about checking the result of move_task_groups() and if it's zero,
> just restore the env.loop counter to the value before moving the task 
> group attempt.
> May be it is worth to do, but i just decided to resurrect the previous 
> behavior -
> thus do not set LBF_NEED_BREAK flag if we reach the end of the run queue.
> 
> 
>> Do we also need same crazy LBF_ALL_PINNED change to move_task_groups's
>> "if (env->loop > env->loop_max) break;" check? Similar to what
>> b0defa7ae03ec ("sched/fair: Make sure to try to detach at least one
>> movable task") does in detach_tasks to allow it to handle the case then
>> lru tail is pinned?
> 
> And again, i thought about it.
> move_task_groups() increases env.loop only in case it founds a task to 
> move,
> so in case some tasks are pinned, env.loop is not increased and we will 
> go further.

Agree.

> 
> So in move_task_groups() we exit either
> 1. we scanned all groups (and did not find enough tasks to move to fix 
> the imbalance)
> 2. we scanned some groups and found/moved tasks > env.loop_max
> 3. we scanned some groups, found/moved some tasks and their number was 
> enough to fix the imbalance

Seems reasonable. Thanks for explanation!

> 
>>>            /* take a breather every nr_migrate tasks */
>>>            if (env->loop > env->loop_break) {
>>

-- 
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.