[Devel] [PATCH RHEL COMMIT] ve/sched/loadavg: Calculate avenrun for Containers root cpu cgroups

Fri Oct 1 19:38:43 MSK 2021

The commit is pushed to "branch-rh9-5.14.vz9.1.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after ark-5.14
------>
commit 79f774543c4b6d6979a06980d927b7054c473182
Author: Konstantin Khorenko <khorenko at virtuozzo.com>
Date:   Fri Oct 1 19:38:43 2021 +0300

    ve/sched/loadavg: Calculate avenrun for Containers root cpu cgroups
    
    This patch is a part of vz7 commit (only avenrun part)
    34a1dc1e4e3d ("sched: Account task_group::cpustat,taskstats,avenrun")
    
      Extracted from "Initial patch".
    
      Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    
      +++
      ve/sched: Do not use kstat_glb_lock to update kstat_glob::nr_unint_avg
    
      kstat_glob::nr_unint_avg can't be updated in parallel on two or
      more cpus, so on modifications we have to protect against readers
      only.
    
      So, avoid using global kstat_glb_lock here, to minimize its
      sharing with another counters it protects.
    
      Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    
    (cherry picked from commit 715f311fdb4ab0b7922f9e53617c5821ae36bfaf)
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    +++
    sched/ve: Use cfs_rq::h_nr_running to count loadavg
    
    cfs_rq::nr_running contains number of child entities
    of one level below: tasks and cfs_rq, but it does not
    contain tasks from deeper levels.
    
    Use cfs_rq::h_nr_running instead as it contains number
    of tasks among all child hierarchy.
    
    https://jira.sw.ru/browse/PSBM-81572
    
    Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    
    Reviewed-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
    
    mFixes: 028c54e613a3 ("sched: Account task_group::avenrun")
    
    (cherry picked from vz7 commit 5f2a49a05629bd709ad6bfce83bfacc58a4db3d9)
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    +++
    sched/ve: Iterate only VE root cpu cgroups to count loadavg
    
    Counting loadavg we are interested in VE root cpu cgroup only,
    as it's analogy of node's loadavg.
    
    So, this patch makes iterate only such types of cpu cgroup,
    when we calc loadavg.
    
    Since this code called from interrupt, this may give positive
    performance resuts.
    
    https://jira.sw.ru/browse/PSBM-81572
    
    Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    
    Reviewed-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
    
    (cherry picked from vz7 commit 4140a241e5ec2230105f5c4513400a6b5ecea92f)
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    +++
    sched: Export calc_load_ve()
    
    This will be used in next patch.
    
    Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    
    =========================
    Patchset description:
    Make calc_load_ve() be executed out of jiffies_lock
    
    https://jira.sw.ru/browse/PSBM-84967
    
    Kirill Tkhai (3):
          sched: Make calc_global_load() return true when it's need to update ve statistic
          sched: Export calc_load_ve()
          sched: Call calc_load_ve() out of jiffies_lock
    
    (cherry picked from vz7 commit 738b92fb2cdd6577925a6b7019925f320cd379df)
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    +++
    sched: Call calc_load_ve() out of jiffies_lock
    
    jiffies_lock is a big global seqlock, which is used in many
    places. In combination with another actions like smp call
    functions and readers of this seqlock, system may hang for
    a long time. There is already a pair of hard lockups because
    of long iteration in calc_load_ve() with jiffies_lock held,
    which made readers of this seqlock to spin long time.
    
    This patch makes calc_load_ve() to use separate lock,
    and this relaxes jiffies_lock. I think, this should be enough
    to resolve the problem, since both the crashes I saw contains
    readers of the seqlock on parallel cpus, and we won't have
    to relax further (say, moving calc_load_ve() to softirq).
    
    Note, that the principal change of this patch makes is
    jiffies_lock readers on parallel cpus won't wait till calc_load_ve()
    finishes, so instead of (n_readers + 1) cpus waiting till
    this function completes, there will be only 1 cpu doing that.
    
    https://jira.sw.ru/browse/PSBM-84967
    
    Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    
    =========================
    Patchset description:
    Make calc_load_ve() be executed out of jiffies_lock
    
    https://jira.sw.ru/browse/PSBM-84967
    
    Kirill Tkhai (3):
          sched: Make calc_global_load() return true when it's need to update ve statistic
          sched: Export calc_load_ve()
          sched: Call calc_load_ve() out of jiffies_lock
    
    +++
    sched: really don't call calc_load_ve() under jiffies_lock
    
    Previously we've done all preparation work for calc_load_ve() not being
    executed under jiffies_lock, and thus not called from
    calc_global_load(), but forgot to drop the call in calc_global_load().
    So now we still call expensive calc_load_ve() under the jiffies_lock and
    get NMI.
    
    Fix that.
    
    mFixes:19bc294a5691d ("sched: Call calc_load_ve() out of jiffies_lock")
    
    https://jira.sw.ru/browse/PSBM-102573
    
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    Signed-off-by: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>
    
    (cherry picked from vz7 commit 0610b98e5b6537d2ecd99522c3cbd1aa939565e7)
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    +++
    ve/sched/loadavg: Provide task_group parameter to get_avenrun_ve()
    
    Rename get_avenrun_ve() to get_avenrun_tg() and provide it
    the task_group argument to use it later for any VE, for the the current
    one.
    
    mFixes: f52cf2752bca ("ve/sched/loadavg: Calculate avenrun for Containers
    root cpu cgroups")
    
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    Reviewed-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
    
    Cherry-picked from vz8 commit 28ee125cda80 ("ve/sched/loadavg: Calculate
    avenrun for Containers root cpu cgroups"))
    
    Followed changes in tick_do_update_jiffies64().
    
    xtime_update() was folded into legacy_timer_tick() that is used only on
    legacy targets, none of those can be used with CONFIG_VE. So don't add
    call to calc_load_ve() there.
    
    Signed-off-by: Nikita Yushchenko <nikita.yushchenko at virtuozzo.com>
---
 include/linux/sched/loadavg.h |  6 +++++
 kernel/sched/loadavg.c        | 58 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h          |  1 +
 kernel/time/tick-common.c     |  9 ++++++-
 kernel/time/tick-sched.c      |  5 +++-
 5 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index 58cca1cef579..bcde43cda5b2 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -47,4 +47,10 @@ extern unsigned long calc_load_n(unsigned long load, unsigned long exp,
 
 extern bool calc_global_load(void);
 
+#ifdef CONFIG_VE
+extern void calc_load_ve(void);
+#else
+#define calc_load_ve() do { } while (0)
+#endif
+
 #endif /* _LINUX_SCHED_LOADAVG_H */
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 62a463cb5cab..c26db6cef1b5 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -76,6 +76,22 @@ void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
 	loads[2] = (avenrun[2] + offset) << shift;
 }
 
+int get_avenrun_tg(struct task_group *tg, unsigned long *loads,
+		   unsigned long offset, int shift)
+{
+	/* Get current tg if not provided. */
+	tg = tg ? tg : task_group(current);
+
+	if (tg == &root_task_group)
+		return -ENOSYS;
+
+	loads[0] = (tg->avenrun[0] + offset) << shift;
+	loads[1] = (tg->avenrun[1] + offset) << shift;
+	loads[2] = (tg->avenrun[2] + offset) << shift;
+
+	return 0;
+}
+
 long calc_load_fold_active(struct rq *this_rq, long adjust)
 {
 	long nr_active, delta = 0;
@@ -91,6 +107,48 @@ long calc_load_fold_active(struct rq *this_rq, long adjust)
 	return delta;
 }
 
+#ifdef CONFIG_VE
+extern struct list_head ve_root_list;
+extern spinlock_t load_ve_lock;
+
+void calc_load_ve(void)
+{
+	unsigned long nr_active;
+	struct task_group *tg;
+	int i;
+
+	/*
+	 * This is called without jiffies_lock, and here we protect
+	 * against very rare parallel execution on two or more cpus.
+	 */
+	spin_lock(&load_ve_lock);
+	list_for_each_entry(tg, &ve_root_list, ve_root_list) {
+		nr_active = 0;
+		for_each_possible_cpu(i) {
+#ifdef CONFIG_FAIR_GROUP_SCHED
+			nr_active += tg->cfs_rq[i]->h_nr_running;
+			/*
+			 * We do not export nr_unint to parent task groups
+			 * like we do for h_nr_running, as it gives additional
+			 * overhead for activate/deactivate operations. So, we
+			 * don't account child cgroup unint tasks here.
+			 */
+			nr_active += tg->cfs_rq[i]->nr_unint;
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+			nr_active += tg->rt_rq[i]->rt_nr_running;
+#endif
+		}
+		nr_active *= FIXED_1;
+
+		tg->avenrun[0] = calc_load(tg->avenrun[0], EXP_1, nr_active);
+		tg->avenrun[1] = calc_load(tg->avenrun[1], EXP_5, nr_active);
+		tg->avenrun[2] = calc_load(tg->avenrun[2], EXP_15, nr_active);
+	}
+	spin_unlock(&load_ve_lock);
+}
+#endif /* CONFIG_VE */
+
 /**
  * fixed_power_int - compute: x^n, in O(log n) time
  *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5ea3a045e6d1..88eec808ae42 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -434,6 +434,7 @@ struct task_group {
 	struct list_head ve_root_list;
 #endif
 
+	unsigned long avenrun[3];	/* loadavg data */
 	/* Monotonic time in nsecs: */
 	u64			start_time;
 
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index d663249652ef..8c0b27efac0c 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -15,6 +15,7 @@
 #include <linux/percpu.h>
 #include <linux/profile.h>
 #include <linux/sched.h>
+#include <linux/sched/loadavg.h>
 #include <linux/module.h>
 #include <trace/events/power.h>
 
@@ -85,15 +86,21 @@ int tick_is_oneshot_available(void)
 static void tick_periodic(int cpu)
 {
 	if (tick_do_timer_cpu == cpu) {
+		bool calc_ve;
+
 		raw_spin_lock(&jiffies_lock);
 		write_seqcount_begin(&jiffies_seq);
 
 		/* Keep track of the next tick event */
 		tick_next_period = ktime_add_ns(tick_next_period, TICK_NSEC);
 
-		do_timer(1);
+		calc_ve = do_timer(1);
 		write_seqcount_end(&jiffies_seq);
 		raw_spin_unlock(&jiffies_lock);
+
+		if (calc_ve)
+			calc_load_ve();
+
 		update_wall_time();
 	}
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6bffe5af8cb1..3c37942befdc 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -58,6 +58,7 @@ static void tick_do_update_jiffies64(ktime_t now)
 {
 	unsigned long ticks = 1;
 	ktime_t delta, nextp;
+	bool calc_ve;
 
 	/*
 	 * 64bit can do a quick check without holding jiffies lock and
@@ -145,9 +146,11 @@ static void tick_do_update_jiffies64(ktime_t now)
 	 */
 	write_seqcount_end(&jiffies_seq);
 
-	calc_global_load();
+	calc_ve = calc_global_load();
 
 	raw_spin_unlock(&jiffies_lock);
+	if (calc_ve)
+		calc_load_ve();
 	update_wall_time();
 }