[Devel] [PATCH RHEL7 COMMIT] ms/sched: Port diff-sched-increase-SCHED_LOAD_SCALE-resolution

Thu Jun 4 04:53:25 PDT 2015

The commit is pushed to "branch-rh7-3.10.0-123.1.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.9
------>
commit a8f671c42b1ce4ab57e099c1d6ddc388404c4082
Author: Vladimir Davydov <vdavydov at parallels.com>
Date:   Thu Jun 4 15:53:25 2015 +0400

    ms/sched: Port diff-sched-increase-SCHED_LOAD_SCALE-resolution
    
    This patch is already upstream, but the feature is disabled by default.
    This patch only enables it.
    
    Author: Vladimir Davydov
    Email: vdavydov at parallels.com
    Subject: sched: Increase SCHED_LOAD_SCALE resolution
    Date: Tue, 25 Dec 2012 13:33:21 +0400
    
    Mainstream commit c8b281161dfa4bb5d5be63fb036ce19347b88c63
    
    Introduce SCHED_LOAD_RESOLUTION, which scales is added to
    SCHED_LOAD_SHIFT and increases the resolution of
    SCHED_LOAD_SCALE. This patch sets the value of
    SCHED_LOAD_RESOLUTION to 10, scaling up the weights for all
    sched entities by a factor of 1024. With this extra resolution,
    we can handle deeper cgroup hiearchies and the scheduler can do
    better shares distribution and load load balancing on larger
    systems (especially for low weight task groups).
    
    This does not change the existing user interface, the scaled
    weights are only used internally. We do not modify
    prio_to_weight values or inverses, but use the original weights
    when calculating the inverse which is used to scale execution
    time delta in calc_delta_mine(). This ensures we do not lose
    accuracy when accounting time to the sched entities. Thanks to
    Nikunj Dadhania for fixing an bug in c_d_m() that broken fairness.
    
    Below is some analysis of the performance costs/improvements of
    this patch.
    
    1. Micro-arch performance costs:
    
    Experiment was to run Ingo's pipe_test_100k 200 times with the
    task pinned to one cpu. I measured instruction, cycles and
    stalled-cycles for the runs. See:
    
       http://thread.gmane.org/gmane.linux.kernel/1129232/focus=1129389
    
    for more info.
    
    -tip (baseline):
    
     Performance counter stats for '/root/load-scale/pipe-test-100k' (200 runs):
    
           964,991,769 instructions             #    0.82  insns per cycle
                                                #    0.33  stalled cycles per insn
                                                #    ( +-  0.05% )
         1,171,186,635 cycles                   #    0.000 GHz                      ( +-  0.08% )
           306,373,664 stalled-cycles-backend   #   26.16% backend  cycles idle     ( +-  0.28% )
           314,933,621 stalled-cycles-frontend  #   26.89% frontend cycles idle     ( +-  0.34% )
    
            1.122405684  seconds time elapsed  ( +-  0.05% )
    
    -tip+patches:
    
     Performance counter stats for './load-scale/pipe-test-100k' (200 runs):
    
           963,624,821 instructions             #    0.82  insns per cycle
                                                #    0.33  stalled cycles per insn
                                                #    ( +-  0.04% )
         1,175,215,649 cycles                   #    0.000 GHz                      ( +-  0.08% )
           315,321,126 stalled-cycles-backend   #   26.83% backend  cycles idle     ( +-  0.28% )
           316,835,873 stalled-cycles-frontend  #   26.96% frontend cycles idle     ( +-  0.29% )
    
            1.122238659  seconds time elapsed  ( +-  0.06% )
    
    With this patch, instructions decrease by ~0.10% and cycles
    increase by 0.27%. This doesn't look statistically significant.
    The number of stalled cycles in the backend increased from
    26.16% to 26.83%. This can be attributed to the shifts we do in
    c_d_m() and other places. The fraction of stalled cycles in the
    frontend remains about the same, at 26.96% compared to 26.89% in -tip.
    
    2. Balancing low-weight task groups
    
    Test setup: run 50 tasks with random sleep/busy times (biased
    around 100ms) in a low weight container (with cpu.shares = 2).
    Measure %idle as reported by mpstat over a 10s window.
    
    -tip (baseline):
    
    06:47:48 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
    06:47:49 PM  all   94.32    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.62  15888.00
    06:47:50 PM  all   94.57    0.00    0.62    0.00    0.00    0.00    0.00    0.00    4.81  16180.00
    06:47:51 PM  all   94.69    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.25  15966.00
    06:47:52 PM  all   95.81    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.19  16053.00
    06:47:53 PM  all   94.88    0.06    0.00    0.00    0.00    0.00    0.00    0.00    5.06  15984.00
    06:47:54 PM  all   93.31    0.00    0.00    0.00    0.00    0.00    0.00    0.00    6.69  15806.00
    06:47:55 PM  all   94.19    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.75  15896.00
    06:47:56 PM  all   92.87    0.00    0.00    0.00    0.00    0.00    0.00    0.00    7.13  15716.00
    06:47:57 PM  all   94.88    0.00    0.00    0.00    0.00    0.00    0.00    0.00    5.12  15982.00
    06:47:58 PM  all   95.44    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.56  16075.00
    Average:     all   94.49    0.01    0.08    0.00    0.00    0.00    0.00    0.00    5.42  15954.60
    
    -tip+patches:
    
    06:47:03 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
    06:47:04 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16630.00
    06:47:05 PM  all   99.69    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.31  16580.20
    06:47:06 PM  all   99.69    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.25  16596.00
    06:47:07 PM  all   99.20    0.00    0.74    0.00    0.00    0.06    0.00    0.00    0.00  17838.61
    06:47:08 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16540.00
    06:47:09 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16575.00
    06:47:10 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16614.00
    06:47:11 PM  all   99.94    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.06  16588.00
    06:47:12 PM  all   99.94    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.00  16593.00
    06:47:13 PM  all   99.94    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.00  16551.00
    Average:     all   99.84    0.00    0.09    0.00    0.00    0.01    0.00    0.00    0.06  16711.58
    
    We see an improvement in idle% on the system (drops from 5.42% on -tip to 0.06%
    with the patches).
    
    We see an improvement in idle% on the system (drops from 5.42%
    on -tip to 0.06% with the patches).
    
    Signed-off-by: Nikhil Rao <ncrao at google.com>
    Acked-by: Peter Zijlstra <peterz at infradead.org>
    Cc: Nikunj A. Dadhania <nikunj at linux.vnet.ibm.com>
    Cc: Srivatsa Vaddagiri <vatsa at linux.vnet.ibm.com>
    Cc: Stephan Barwolf <stephan.baerwolf at tu-ilmenau.de>
    Cc: Mike Galbraith <efault at gmx.de>
    Cc: Linus Torvalds <torvalds at linux-foundation.org>
    Cc: Andrew Morton <akpm at linux-foundation.org>
    Link: http://lkml.kernel.org/r/1305754668-18792-1-git-send-email-ncrao@google.com
    Signed-off-by: Ingo Molnar <mingo at elte.hu>
    
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
    
    =============================================================================
    
    Related to https://jira.sw.ru/browse/PSBM-33642
    
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
---
 kernel/sched/sched.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d4053c6..e4f92a5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -48,7 +48,7 @@ extern __read_mostly int scheduler_running;
  * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
  * increased costs.
  */
-#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
+#if BITS_PER_LONG > 32
 # define SCHED_LOAD_RESOLUTION	10
 # define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
 # define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)