[Devel] [PATCH rh7] Port diff-sched-initialize-runtime-to-non-zero-on-cfs-bw-set

Vladimir Davydov vdavydov at parallels.com
Fri May 29 06:18:42 PDT 2015


Author: Vladimir Davydov
Email: vdavydov at parallels.com
Subject: sched: initialize runtime to non-zero on cfs bw set
Date: Mon, 21 Jan 2013 11:44:58 +0400

* [sched] running tasks could be throttled and never unthrottled
	thus causing random node hangs. (PSBM-17658)

If cfs_rq->runtime_remaining is <= 0 then either
- cfs_rq is throttled and waiting for quota redistribution, or
- cfs_rq is currently executing and will be throttled on
  put_prev_entity, or
- cfs_rq is not throttled and has not executed since its quota was set
  (runtime_remaining is set to 0 on cfs bandwidth reconfiguration).

It is obvious that the last case is rather an exception from the rule
"runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as
soon as it finishes its execution". Moreover, it can lead to a task hang
as follows. If put_prev_task is called immediately after first
pick_next_task after quota was set, "immediately" meaning rq->clock in
both functions is the same, then the corresponding cfs_rq will be
throttled. Besides being unfair (the cfs_rq has not executed in fact),
the quota refilling timer can be idle at that time and it won't be
activated on put_prev_task because update_curr calls
account_cfs_rq_runtime, which activates the timer, only if delta_exec is
strictly positive. As a result we can get a task "running" inside a
throttled cfs_rq which will probably never be unthrottled.

To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize
runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq
will be throttled only if it has executed for some positive number of
nanoseconds.

Several times we had our customers encountered such hangs inside a VM
(seems something is wrong or rather different in time accounting there).
Analyzing crash dumps revealed that hung tasks were running inside
cfs_rq's, which had the following setup

cfs_rq->throttled=1
cfs_rq->runtime_enabled=1
cfs_rq->runtime_remaining=0
cfs_rq->tg->cfs_bandwidth.idle=1
cfs_rq->tg->cfs_bandwidth.timer_active=0

which conforms pretty nice to the explanation given above.

https://jira.sw.ru/browse/PSBM-17658

Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
=============================================================================

Related to https://jira.sw.ru/browse/PSBM-33642

Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4e6254b641a6..d8831c91bebb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8342,7 +8342,7 @@ static int __tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 
 		raw_spin_lock_irq(&rq->lock);
 		cfs_rq->runtime_enabled = runtime_enabled;
-		cfs_rq->runtime_remaining = 0;
+		cfs_rq->runtime_remaining = 1;
 
 		if (cfs_rq->throttled)
 			unthrottle_cfs_rq(cfs_rq);
-- 
2.1.4




More information about the Devel mailing list