[Devel] [PATCH RHEL7 COMMIT] sched: Port diff-sched-initialize-runtime-to-non-zero-on-cfs-bw-set

Thu Jun 4 04:53:10 PDT 2015

The commit is pushed to "branch-rh7-3.10.0-123.1.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.9
------>
commit 034153bb1d78cbfd0c44c0e91c3c77d4178606b6
Author: Vladimir Davydov <vdavydov at parallels.com>
Date:   Thu Jun 4 15:53:10 2015 +0400

    sched: Port diff-sched-initialize-runtime-to-non-zero-on-cfs-bw-set
    
    Author: Vladimir Davydov
    Email: vdavydov at parallels.com
    Subject: sched: initialize runtime to non-zero on cfs bw set
    Date: Mon, 21 Jan 2013 11:44:58 +0400
    
    * [sched] running tasks could be throttled and never unthrottled
    	thus causing random node hangs. (PSBM-17658)
    
    If cfs_rq->runtime_remaining is <= 0 then either
    - cfs_rq is throttled and waiting for quota redistribution, or
    - cfs_rq is currently executing and will be throttled on
      put_prev_entity, or
    - cfs_rq is not throttled and has not executed since its quota was set
      (runtime_remaining is set to 0 on cfs bandwidth reconfiguration).
    
    It is obvious that the last case is rather an exception from the rule
    "runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as
    soon as it finishes its execution". Moreover, it can lead to a task hang
    as follows. If put_prev_task is called immediately after first
    pick_next_task after quota was set, "immediately" meaning rq->clock in
    both functions is the same, then the corresponding cfs_rq will be
    throttled. Besides being unfair (the cfs_rq has not executed in fact),
    the quota refilling timer can be idle at that time and it won't be
    activated on put_prev_task because update_curr calls
    account_cfs_rq_runtime, which activates the timer, only if delta_exec is
    strictly positive. As a result we can get a task "running" inside a
    throttled cfs_rq which will probably never be unthrottled.
    
    To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize
    runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq
    will be throttled only if it has executed for some positive number of
    nanoseconds.
    
    Several times we had our customers encountered such hangs inside a VM
    (seems something is wrong or rather different in time accounting there).
    Analyzing crash dumps revealed that hung tasks were running inside
    cfs_rq's, which had the following setup
    
    cfs_rq->throttled=1
    cfs_rq->runtime_enabled=1
    cfs_rq->runtime_remaining=0
    cfs_rq->tg->cfs_bandwidth.idle=1
    cfs_rq->tg->cfs_bandwidth.timer_active=0
    
    which conforms pretty nice to the explanation given above.
    
    https://jira.sw.ru/browse/PSBM-17658
    
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
    
    =============================================================================
    
    Related to https://jira.sw.ru/browse/PSBM-33642
    
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4e6254b..d8831c9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8342,7 +8342,7 @@ static int __tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 
 		raw_spin_lock_irq(&rq->lock);
 		cfs_rq->runtime_enabled = runtime_enabled;
-		cfs_rq->runtime_remaining = 0;
+		cfs_rq->runtime_remaining = 1;
 
 		if (cfs_rq->throttled)
 			unthrottle_cfs_rq(cfs_rq);