[Devel] [PATCH rh8 3/6] ve/sched/loadavg: Calculate avenrun for Containers root cpu cgroups

Thu Oct 22 15:54:49 MSK 2020

This patch is a part of vz7 commit (only avenrun part)
34a1dc1e4e3d ("sched: Account task_group::cpustat,taskstats,avenrun")

  Extracted from "Initial patch".

  Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>

  +++
  ve/sched: Do not use kstat_glb_lock to update kstat_glob::nr_unint_avg

  kstat_glob::nr_unint_avg can't be updated in parallel on two or
  more cpus, so on modifications we have to protect against readers
  only.

  So, avoid using global kstat_glb_lock here, to minimize its
  sharing with another counters it protects.

  Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>

(cherry picked from commit 715f311fdb4ab0b7922f9e53617c5821ae36bfaf)
Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>

+++
sched/ve: Use cfs_rq::h_nr_running to count loadavg

cfs_rq::nr_running contains number of child entities
of one level below: tasks and cfs_rq, but it does not
contain tasks from deeper levels.

Use cfs_rq::h_nr_running instead as it contains number
of tasks among all child hierarchy.

https://jira.sw.ru/browse/PSBM-81572

Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
Reviewed-by: Andrey Ryabinin <aryabinin at virtuozzo.com>

mFixes: 028c54e613a3 ("sched: Account task_group::avenrun")

(cherry picked from vz7 commit 5f2a49a05629bd709ad6bfce83bfacc58a4db3d9)
Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>

+++
sched/ve: Iterate only VE root cpu cgroups to count loadavg

Counting loadavg we are interested in VE root cpu cgroup only,
as it's analogy of node's loadavg.

So, this patch makes iterate only such types of cpu cgroup,
when we calc loadavg.

Since this code called from interrupt, this may give positive
performance resuts.

https://jira.sw.ru/browse/PSBM-81572

Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
Reviewed-by: Andrey Ryabinin <aryabinin at virtuozzo.com>

(cherry picked from vz7 commit 4140a241e5ec2230105f5c4513400a6b5ecea92f)
Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>

+++
sched: Export calc_load_ve()

This will be used in next patch.

Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>

=========================
Patchset description:
Make calc_load_ve() be executed out of jiffies_lock

https://jira.sw.ru/browse/PSBM-84967

Kirill Tkhai (3):
      sched: Make calc_global_load() return true when it's need to update ve statistic
      sched: Export calc_load_ve()
      sched: Call calc_load_ve() out of jiffies_lock

(cherry picked from vz7 commit 738b92fb2cdd6577925a6b7019925f320cd379df)
Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>

+++
sched: Call calc_load_ve() out of jiffies_lock

jiffies_lock is a big global seqlock, which is used in many
places. In combination with another actions like smp call
functions and readers of this seqlock, system may hang for
a long time. There is already a pair of hard lockups because
of long iteration in calc_load_ve() with jiffies_lock held,
which made readers of this seqlock to spin long time.

This patch makes calc_load_ve() to use separate lock,
and this relaxes jiffies_lock. I think, this should be enough
to resolve the problem, since both the crashes I saw contains
readers of the seqlock on parallel cpus, and we won't have
to relax further (say, moving calc_load_ve() to softirq).

Note, that the principal change of this patch makes is
jiffies_lock readers on parallel cpus won't wait till calc_load_ve()
finishes, so instead of (n_readers + 1) cpus waiting till
this function completes, there will be only 1 cpu doing that.

https://jira.sw.ru/browse/PSBM-84967

Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>

=========================
Patchset description:
Make calc_load_ve() be executed out of jiffies_lock

https://jira.sw.ru/browse/PSBM-84967

Kirill Tkhai (3):
      sched: Make calc_global_load() return true when it's need to update ve statistic
      sched: Export calc_load_ve()
      sched: Call calc_load_ve() out of jiffies_lock

+++
sched: really don't call calc_load_ve() under jiffies_lock

Previously we've done all preparation work for calc_load_ve() not being
executed under jiffies_lock, and thus not called from
calc_global_load(), but forgot to drop the call in calc_global_load().
So now we still call expensive calc_load_ve() under the jiffies_lock and
get NMI.

Fix that.

mFixes:19bc294a5691d ("sched: Call calc_load_ve() out of jiffies_lock")

https://jira.sw.ru/browse/PSBM-102573

Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
Signed-off-by: Valeriy Vdovin <valeriy.vdovin at virtuozzo.com>

(cherry picked from vz7 commit 0610b98e5b6537d2ecd99522c3cbd1aa939565e7)
Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
---
 include/linux/sched/loadavg.h |  8 ++++++
 kernel/sched/loadavg.c        | 50 +++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h          |  1 +
 kernel/time/tick-common.c     |  9 ++++++-
 kernel/time/tick-sched.c      |  6 ++++-
 kernel/time/timekeeping.c     |  5 +++-
 6 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index 34061919f880..1da5768389b7 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -16,6 +16,8 @@
  */
 extern unsigned long avenrun[];		/* Load averages */
 extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);
+extern void get_avenrun_ve(unsigned long *loads,
+			   unsigned long offset, int shift);
 
 #define FSHIFT		11		/* nr of bits of precision */
 #define FIXED_1		(1<<FSHIFT)	/* 1.0 as fixed-point */
@@ -47,4 +49,10 @@ extern unsigned long calc_load_n(unsigned long load, unsigned long exp,
 
 extern bool calc_global_load(unsigned long ticks);
 
+#ifdef CONFIG_VE
+extern void calc_load_ve(void);
+#else
+#define calc_load_ve() do { } while (0)
+#endif
+
 #endif /* _LINUX_SCHED_LOADAVG_H */
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index a7b373053dc4..c62f34033112 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -76,6 +76,14 @@ void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
 	loads[2] = (avenrun[2] + offset) << shift;
 }
 
+void get_avenrun_ve(unsigned long *loads, unsigned long offset, int shift)
+{
+	struct task_group *tg = task_group(current);
+	loads[0] = (tg->avenrun[0] + offset) << shift;
+	loads[1] = (tg->avenrun[1] + offset) << shift;
+	loads[2] = (tg->avenrun[2] + offset) << shift;
+}
+
 long calc_load_fold_active(struct rq *this_rq, long adjust)
 {
 	long nr_active, delta = 0;
@@ -91,6 +99,48 @@ long calc_load_fold_active(struct rq *this_rq, long adjust)
 	return delta;
 }
 
+#ifdef CONFIG_VE
+extern struct list_head ve_root_list;
+extern spinlock_t load_ve_lock;
+
+void calc_load_ve(void)
+{
+	unsigned long nr_active;
+	struct task_group *tg;
+	int i;
+
+	/*
+	 * This is called without jiffies_lock, and here we protect
+	 * against very rare parallel execution on two or more cpus.
+	 */
+	spin_lock(&load_ve_lock);
+	list_for_each_entry(tg, &ve_root_list, ve_root_list) {
+		nr_active = 0;
+		for_each_possible_cpu(i) {
+#ifdef CONFIG_FAIR_GROUP_SCHED
+			nr_active += tg->cfs_rq[i]->h_nr_running;
+			/*
+			 * We do not export nr_unint to parent task groups
+			 * like we do for h_nr_running, as it gives additional
+			 * overhead for activate/deactivate operations. So, we
+			 * don't account child cgroup unint tasks here.
+			 */
+			nr_active += tg->cfs_rq[i]->nr_unint;
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+			nr_active += tg->rt_rq[i]->rt_nr_running;
+#endif
+		}
+		nr_active *= FIXED_1;
+
+		tg->avenrun[0] = calc_load(tg->avenrun[0], EXP_1, nr_active);
+		tg->avenrun[1] = calc_load(tg->avenrun[1], EXP_5, nr_active);
+		tg->avenrun[2] = calc_load(tg->avenrun[2], EXP_15, nr_active);
+	}
+	spin_unlock(&load_ve_lock);
+}
+#endif /* CONFIG_VE */
+
 /**
  * fixed_power_int - compute: x^n, in O(log n) time
  *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 93bf1d78c27d..3f1e5ba43910 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -408,6 +408,7 @@ struct task_group {
 	struct list_head ve_root_list;
 #endif
 
+	unsigned long avenrun[3];	/* loadavg data */
 	/* Monotonic time in nsecs: */
 	u64			start_time;
 
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 61ce3505c195..47a9e0719ee8 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -18,6 +18,7 @@
 #include <linux/percpu.h>
 #include <linux/profile.h>
 #include <linux/sched.h>
+#include <linux/sched/loadavg.h>
 #include <linux/module.h>
 #include <trace/events/power.h>
 
@@ -87,13 +88,19 @@ int tick_is_oneshot_available(void)
 static void tick_periodic(int cpu)
 {
 	if (tick_do_timer_cpu == cpu) {
+		bool calc_ve;
+
 		write_seqlock(&jiffies_lock);
 
 		/* Keep track of the next tick event */
 		tick_next_period = ktime_add(tick_next_period, tick_period);
 
-		do_timer(1);
+		calc_ve = do_timer(1);
 		write_sequnlock(&jiffies_lock);
+
+		if (calc_ve)
+			calc_load_ve();
+
 		update_wall_time();
 	}
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4380af8ac923..5f265f7cce76 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -23,6 +23,7 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/stat.h>
 #include <linux/sched/nohz.h>
+#include <linux/sched/loadavg.h>
 #include <linux/module.h>
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
@@ -57,6 +58,7 @@ static ktime_t last_jiffies_update;
 static void tick_do_update_jiffies64(ktime_t now)
 {
 	unsigned long ticks = 0;
+	bool calc_ve = false;
 	ktime_t delta;
 
 	/*
@@ -85,7 +87,7 @@ static void tick_do_update_jiffies64(ktime_t now)
 			last_jiffies_update = ktime_add_ns(last_jiffies_update,
 							   incr * ticks);
 		}
-		do_timer(++ticks);
+		calc_ve = do_timer(++ticks);
 
 		/* Keep the tick_next_period variable up to date */
 		tick_next_period = ktime_add(last_jiffies_update, tick_period);
@@ -94,6 +96,8 @@ static void tick_do_update_jiffies64(ktime_t now)
 		return;
 	}
 	write_sequnlock(&jiffies_lock);
+	if (calc_ve)
+		calc_load_ve();
 	update_wall_time();
 }
 
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index bce92a9952f4..3b6500c5a357 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -2398,8 +2398,11 @@ EXPORT_SYMBOL(hardpps);
  */
 void xtime_update(unsigned long ticks)
 {
+	bool calc_ve;
 	write_seqlock(&jiffies_lock);
-	do_timer(ticks);
+	calc_ve = do_timer(ticks);
 	write_sequnlock(&jiffies_lock);
+	if (calc_ve)
+		calc_load_ve();
 	update_wall_time();
 }
-- 
2.28.0