[Devel] [PATCH RHEL8 COMMIT] ve/sched/loadavg: Calculate avenrun for Containers root cpu cgroups

Wed Oct 28 19:20:32 MSK 2020

The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.14
------>
commit f52cf2752bca3141c1cfd2585d0ff1ad9d2dffac
Author: Konstantin Khorenko <khorenko at virtuozzo.com>
Date:   Mon Dec 11 23:40:07 2017 +0300

    ve/sched/loadavg: Calculate avenrun for Containers root cpu cgroups
    
    This patch is a part of vz7 commit (only avenrun part)
    34a1dc1e4e3d ("sched: Account task_group::cpustat,taskstats,avenrun")
    
      Extracted from "Initial patch".
    
      Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    
      +++
      ve/sched: Do not use kstat_glb_lock to update kstat_glob::nr_unint_avg
    
      kstat_glob::nr_unint_avg can't be updated in parallel on two or
      more cpus, so on modifications we have to protect against readers
      only.
    
      So, avoid using global kstat_glb_lock here, to minimize its
      sharing with another counters it protects.
    
      Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    
    (cherry picked from commit 715f311fdb4ab0b7922f9e53617c5821ae36bfaf)
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    +++
    sched/ve: Use cfs_rq::h_nr_running to count loadavg
    
    cfs_rq::nr_running contains number of child entities
    of one level below: tasks and cfs_rq, but it does not
    contain tasks from deeper levels.
    
    Use cfs_rq::h_nr_running instead as it contains number
    of tasks among all child hierarchy.
    
    https://jira.sw.ru/browse/PSBM-81572
    
    Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    Reviewed-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
    
    mFixes: 028c54e613a3 ("sched: Account task_group::avenrun")
    
    (cherry picked from vz7 commit 5f2a49a05629bd709ad6bfce83bfacc58a4db3d9)
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    +++
    sched/ve: Iterate only VE root cpu cgroups to count loadavg
    
    Counting loadavg we are interested in VE root cpu cgroup only,
    as it's analogy of node's loadavg.
    
    So, this patch makes iterate only such types of cpu cgroup,
    when we calc loadavg.
    
    Since this code called from interrupt, this may give positive
    performance resuts.
    
    https://jira.sw.ru/browse/PSBM-81572
    
    Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    Reviewed-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
    
    (cherry picked from vz7 commit 4140a241e5ec2230105f5c4513400a6b5ecea92f)
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    +++
    sched: Export calc_load_ve()
    
    This will be used in next patch.
    
    Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    
    =========================
    Patchset description:
    Make calc_load_ve() be executed out of jiffies_lock
    
    https://jira.sw.ru/browse/PSBM-84967
    
    Kirill Tkhai (3):
          sched: Make calc_global_load() return true when it's need to update ve statistic
          sched: Export calc_load_ve()
          sched: Call calc_load_ve() out of jiffies_lock
    
    (cherry picked from vz7 commit 738b92fb2cdd6577925a6b7019925f320cd379df)
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    +++
    sched: Call calc_load_ve() out of jiffies_lock
    
    jiffies_lock is a big global seqlock, which is used in many
    places. In combination with another actions like smp call
    functions and readers of this seqlock, system may hang for
    a long time. There is already a pair of hard lockups because
    of long iteration in calc_load_ve() with jiffies_lock held,
    which made readers of this seqlock to spin long time.
    
    This patch makes calc_load_ve() to use separate lock,
    and this relaxes jiffies_lock. I think, this should be enough
    to resolve the problem, since both the crashes I saw contains
    readers of the seqlock on parallel cpus, and we won't have
    to relax further (say, moving calc_load_ve() to softirq).
    
    Note, that the principal change of this patch makes is
    jiffies_lock readers on parallel cpus won't wait till calc_load_ve()
    finishes, so instead of (n_readers + 1) cpus waiting till
    this function completes, there will be only 1 cpu doing that.
    
    https://jira.sw.ru/browse/PSBM-84967
    
    Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    
    =========================
    Patchset description:
    Make calc_load_ve() be executed out of jiffies_lock
    
    https://jira.sw.ru/browse/PSBM-84967
    
    Kirill Tkhai (3):
          sched: Make calc_global_load() return true when it's need to update ve statistic
          sched: Export calc_load_ve()
          sched: Call calc_load_ve() out of jiffies_lock
    
    +++
    sched: really don't call calc_load_ve() under jiffies_lock
    
    Previously we've done all preparation work for calc_load_ve() not being
    executed under jiffies_lock, and thus not called from
    calc_global_load(), but forgot to drop the call in calc_global_load().
    So now we still call expensive calc_load_ve() under the jiffies_lock and
    get NMI.
    
    Fix that.
    
    mFixes:19bc294a5691d ("sched: Call calc_load_ve() out of jiffies_lock")
    
    https://jira.sw.ru/browse/PSBM-102573
    
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    Signed-off-by: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
    
    (cherry picked from vz7 commit 0610b98e5b6537d2ecd99522c3cbd1aa939565e7)
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
---
 include/linux/sched/loadavg.h |  8 +++++++
 kernel/sched/loadavg.c        | 50 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h          |  1 +
 kernel/time/tick-common.c     |  9 +++++++-
 kernel/time/tick-sched.c      |  6 +++++-
 kernel/time/timekeeping.c     |  5 ++++-
 6 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index 34061919f880..1da5768389b7 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -16,6 +16,8 @@
  */
 extern unsigned long avenrun[];		/* Load averages */
 extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);
+extern void get_avenrun_ve(unsigned long *loads,
+			   unsigned long offset, int shift);
 
 #define FSHIFT		11		/* nr of bits of precision */
 #define FIXED_1		(1<<FSHIFT)	/* 1.0 as fixed-point */
@@ -47,4 +49,10 @@ extern unsigned long calc_load_n(unsigned long load, unsigned long exp,
 
 extern bool calc_global_load(unsigned long ticks);
 
+#ifdef CONFIG_VE
+extern void calc_load_ve(void);
+#else
+#define calc_load_ve() do { } while (0)
+#endif
+
 #endif /* _LINUX_SCHED_LOADAVG_H */
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index a7b373053dc4..c62f34033112 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -76,6 +76,14 @@ void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
 	loads[2] = (avenrun[2] + offset) << shift;
 }
 
+void get_avenrun_ve(unsigned long *loads, unsigned long offset, int shift)
+{
+	struct task_group *tg = task_group(current);
+	loads[0] = (tg->avenrun[0] + offset) << shift;
+	loads[1] = (tg->avenrun[1] + offset) << shift;
+	loads[2] = (tg->avenrun[2] + offset) << shift;
+}
+
 long calc_load_fold_active(struct rq *this_rq, long adjust)
 {
 	long nr_active, delta = 0;
@@ -91,6 +99,48 @@ long calc_load_fold_active(struct rq *this_rq, long adjust)
 	return delta;
 }
 
+#ifdef CONFIG_VE
+extern struct list_head ve_root_list;
+extern spinlock_t load_ve_lock;
+
+void calc_load_ve(void)
+{
+	unsigned long nr_active;
+	struct task_group *tg;
+	int i;
+
+	/*
+	 * This is called without jiffies_lock, and here we protect
+	 * against very rare parallel execution on two or more cpus.
+	 */
+	spin_lock(&load_ve_lock);
+	list_for_each_entry(tg, &ve_root_list, ve_root_list) {
+		nr_active = 0;
+		for_each_possible_cpu(i) {
+#ifdef CONFIG_FAIR_GROUP_SCHED
+			nr_active += tg->cfs_rq[i]->h_nr_running;
+			/*
+			 * We do not export nr_unint to parent task groups
+			 * like we do for h_nr_running, as it gives additional
+			 * overhead for activate/deactivate operations. So, we
+			 * don't account child cgroup unint tasks here.
+			 */
+			nr_active += tg->cfs_rq[i]->nr_unint;
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+			nr_active += tg->rt_rq[i]->rt_nr_running;
+#endif
+		}
+		nr_active *= FIXED_1;
+
+		tg->avenrun[0] = calc_load(tg->avenrun[0], EXP_1, nr_active);
+		tg->avenrun[1] = calc_load(tg->avenrun[1], EXP_5, nr_active);
+		tg->avenrun[2] = calc_load(tg->avenrun[2], EXP_15, nr_active);
+	}
+	spin_unlock(&load_ve_lock);
+}
+#endif /* CONFIG_VE */
+
 /**
  * fixed_power_int - compute: x^n, in O(log n) time
  *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 93bf1d78c27d..3f1e5ba43910 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -408,6 +408,7 @@ struct task_group {
 	struct list_head ve_root_list;
 #endif
 
+	unsigned long avenrun[3];	/* loadavg data */
 	/* Monotonic time in nsecs: */
 	u64			start_time;
 
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 61ce3505c195..47a9e0719ee8 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -18,6 +18,7 @@
 #include <linux/percpu.h>
 #include <linux/profile.h>
 #include <linux/sched.h>
+#include <linux/sched/loadavg.h>
 #include <linux/module.h>
 #include <trace/events/power.h>
 
@@ -87,13 +88,19 @@ int tick_is_oneshot_available(void)
 static void tick_periodic(int cpu)
 {
 	if (tick_do_timer_cpu == cpu) {
+		bool calc_ve;
+
 		write_seqlock(&jiffies_lock);
 
 		/* Keep track of the next tick event */
 		tick_next_period = ktime_add(tick_next_period, tick_period);
 
-		do_timer(1);
+		calc_ve = do_timer(1);
 		write_sequnlock(&jiffies_lock);
+
+		if (calc_ve)
+			calc_load_ve();
+
 		update_wall_time();
 	}
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4380af8ac923..5f265f7cce76 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -23,6 +23,7 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/stat.h>
 #include <linux/sched/nohz.h>
+#include <linux/sched/loadavg.h>
 #include <linux/module.h>
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
@@ -57,6 +58,7 @@ static ktime_t last_jiffies_update;
 static void tick_do_update_jiffies64(ktime_t now)
 {
 	unsigned long ticks = 0;
+	bool calc_ve = false;
 	ktime_t delta;
 
 	/*
@@ -85,7 +87,7 @@ static void tick_do_update_jiffies64(ktime_t now)
 			last_jiffies_update = ktime_add_ns(last_jiffies_update,
 							   incr * ticks);
 		}
-		do_timer(++ticks);
+		calc_ve = do_timer(++ticks);
 
 		/* Keep the tick_next_period variable up to date */
 		tick_next_period = ktime_add(last_jiffies_update, tick_period);
@@ -94,6 +96,8 @@ static void tick_do_update_jiffies64(ktime_t now)
 		return;
 	}
 	write_sequnlock(&jiffies_lock);
+	if (calc_ve)
+		calc_load_ve();
 	update_wall_time();
 }
 
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index bce92a9952f4..3b6500c5a357 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -2398,8 +2398,11 @@ EXPORT_SYMBOL(hardpps);
  */
 void xtime_update(unsigned long ticks)
 {
+	bool calc_ve;
 	write_seqlock(&jiffies_lock);
-	do_timer(ticks);
+	calc_ve = do_timer(ticks);
 	write_sequnlock(&jiffies_lock);
+	if (calc_ve)
+		calc_load_ve();
 	update_wall_time();
 }