[Devel] [PATCH RHEL7 COMMIT] sched: Port diff-sched-initialize-runtime-to-non-zero-on-cfs-bw-set
Konstantin Khorenko
khorenko at virtuozzo.com
Thu Jun 4 04:53:10 PDT 2015
The commit is pushed to "branch-rh7-3.10.0-123.1.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.9
------>
commit 034153bb1d78cbfd0c44c0e91c3c77d4178606b6
Author: Vladimir Davydov <vdavydov at parallels.com>
Date: Thu Jun 4 15:53:10 2015 +0400
sched: Port diff-sched-initialize-runtime-to-non-zero-on-cfs-bw-set
Author: Vladimir Davydov
Email: vdavydov at parallels.com
Subject: sched: initialize runtime to non-zero on cfs bw set
Date: Mon, 21 Jan 2013 11:44:58 +0400
* [sched] running tasks could be throttled and never unthrottled
thus causing random node hangs. (PSBM-17658)
If cfs_rq->runtime_remaining is <= 0 then either
- cfs_rq is throttled and waiting for quota redistribution, or
- cfs_rq is currently executing and will be throttled on
put_prev_entity, or
- cfs_rq is not throttled and has not executed since its quota was set
(runtime_remaining is set to 0 on cfs bandwidth reconfiguration).
It is obvious that the last case is rather an exception from the rule
"runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as
soon as it finishes its execution". Moreover, it can lead to a task hang
as follows. If put_prev_task is called immediately after first
pick_next_task after quota was set, "immediately" meaning rq->clock in
both functions is the same, then the corresponding cfs_rq will be
throttled. Besides being unfair (the cfs_rq has not executed in fact),
the quota refilling timer can be idle at that time and it won't be
activated on put_prev_task because update_curr calls
account_cfs_rq_runtime, which activates the timer, only if delta_exec is
strictly positive. As a result we can get a task "running" inside a
throttled cfs_rq which will probably never be unthrottled.
To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize
runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq
will be throttled only if it has executed for some positive number of
nanoseconds.
Several times we had our customers encountered such hangs inside a VM
(seems something is wrong or rather different in time accounting there).
Analyzing crash dumps revealed that hung tasks were running inside
cfs_rq's, which had the following setup
cfs_rq->throttled=1
cfs_rq->runtime_enabled=1
cfs_rq->runtime_remaining=0
cfs_rq->tg->cfs_bandwidth.idle=1
cfs_rq->tg->cfs_bandwidth.timer_active=0
which conforms pretty nice to the explanation given above.
https://jira.sw.ru/browse/PSBM-17658
Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
=============================================================================
Related to https://jira.sw.ru/browse/PSBM-33642
Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
---
kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4e6254b..d8831c9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8342,7 +8342,7 @@ static int __tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
raw_spin_lock_irq(&rq->lock);
cfs_rq->runtime_enabled = runtime_enabled;
- cfs_rq->runtime_remaining = 0;
+ cfs_rq->runtime_remaining = 1;
if (cfs_rq->throttled)
unthrottle_cfs_rq(cfs_rq);
More information about the Devel
mailing list