[Devel] [PATCH RHEL7 COMMIT] ms/sched: Guarantee new group-entities always have weight

Tue Oct 11 04:55:13 PDT 2016

The commit is pushed to "branch-rh7-3.10.0-327.36.1.vz7.18.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.36.1.vz7.18.6
------>
commit 83311efdbbdfcba741ec5e9dd148062b8144ab46
Author: Kirill Tkhai <ktkhai at virtuozzo.com>
Date:   Tue Oct 11 15:55:13 2016 +0400

    ms/sched: Guarantee new group-entities always have weight
    
    There is a node with never scheduled cpu cgroup. Its cfs_rq looks like:
    
    struct cfs_rq {
      load = {
        weight = 0,
        inv_weight = 0
      },
      nr_running = 2,
      throttled = 0,
    
    while its sched_entity has on_rq = 0.
    
    The only way, where not-empty cfs_rq may be deqeued,
    is when it was throttled. But now it's not throttled,
    this means that unthrottle_cfs_rq() earlier skipped
    its queueing. This may happen only
    if cfs_rq->load.weight is 0, and we see that.
    
    cfs_rq.load.weight is assigned by update_load_set(),
    and it's possible in two places:
    
    update_cfs_shares()
      shares = calc_cfs_shares(cfs_rq, tg); /* returns shares > 0 */
      reweight_entity(shares);
         update_load_set(shares);
    and
    
    init_tg_cfs_entry()
      update_load_set(0);
    
    As it's clearly seen, assignment of 0 is possible in only place,
    and there is a race, which has already fixed in mainstream.
    So, let's backport their patch to fix the problem:
    
    ms commit 0ac9b1c21874 from Paul Turner <pjt at google.com>
    "sched: Guarantee new group-entities always have weight"
    
    Currently, group entity load-weights are initialized to zero. This
    admits some races with respect to the first time they are re-weighted in
    earlty use. ( Let g[x] denote the se for "g" on cpu "x". )
    
    Suppose that we have root->a and that a enters a throttled state,
    immediately followed by a[0]->t1 (the only task running on cpu[0])
    blocking:
    
      put_prev_task(group_cfs_rq(a[0]), t1)
      put_prev_entity(..., t1)
      check_cfs_rq_runtime(group_cfs_rq(a[0]))
      throttle_cfs_rq(group_cfs_rq(a[0]))
    
    Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
    time:
    
      enqueue_task_fair(rq[0], t2)
      enqueue_entity(group_cfs_rq(b[0]), t2)
      enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
      account_entity_enqueue(group_cfs_ra(b[0]), t2)
      update_cfs_shares(group_cfs_rq(b[0]))
      < skipped because b is part of a throttled hierarchy >
      enqueue_entity(group_cfs_rq(a[0]), b[0])
      ...
    
    We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
    which violates invariants in several code-paths. Eliminate the
    possibility of this by initializing group entity weight.
    
    Signed-off-by: Paul Turner <pjt at google.com>
    Signed-off-by: Peter Zijlstra <peterz at infradead.org>
    Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com
    Signed-off-by: Ingo Molnar <mingo at kernel.org>
    
    https://jira.sw.ru/browse/PSBM-53136
    
    Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    Acked-by: Cyrill Gorcunov <gorcunov at openvz.org>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index be0c3dd..96fe1d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8039,7 +8039,8 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 	}
 
 	se->my_q = cfs_rq;
-	update_load_set(&se->load, 0);
+	/* guarantee group entities always have weight */
+	update_load_set(&se->load, NICE_0_LOAD);
 	se->parent = parent;
 
 #ifdef CONFIG_SCHEDSTATS