[Devel] [PATCH RHEL7 COMMIT] sched: Port cpustat related patches

Konstantin Khorenko khorenko at virtuozzo.com
Thu Jun 4 05:58:50 PDT 2015


The commit is pushed to "branch-rh7-3.10.0-123.1.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.9
------>
commit f1354eb35858c7229b4922bdeef009f744f10be0
Author: Vladimir Davydov <vdavydov at parallels.com>
Date:   Thu Jun 4 16:58:50 2015 +0400

    sched: Port cpustat related patches
    
    This patch ports:
    
    diff-sched-rework-_proc_stat-output
    diff-sched-fix-output-of-vestat-idle
    diff-sched-make-allowance-for-vcpu-rate-in-_proc_stat
    diff-sched-hide-steal-time-from-inside-CT
    diff-sched-cpu.proc.stat-always-count-nr_running-and-co-on-all-cpus
    
    Author: Vladimir Davydov
    Email: vdavydov at parallels.com
    Subject: sched: rework /proc/stat output
    Date: Thu, 29 May 2014 11:40:50 +0400
    
    Initially we mapped usage pct on physical cpu i to vcpu (i % nr_vcpus).
    Obviously, if a CT is running on physical cpus equal mod nr_vcpus, we'll
    miss usage on one or more vcpus. F.e., if there is a 2-vcpus CT with
    several cpu-eaters running on physical cpus 0, 2, we'll get vcpu 0
    200%-busy, and vcpu 1 idling.
    
    To fix that, we changed behavior so that vcpu i usage equals total cpu
    usage divided by nr_vcpus. That led to customers' dissatisfaction,
    because such an algorithm reveals the fake.
    
    So, now we're going to use the first algorithm, but if the usage of one
    of vcpus turns out to be greater than abs time delta, we'll "move" the
    usage excess to other vcpus, so that one vcpu will never consume more
    than one pcpu. F.e., in the situation described above, we'll move 100%
    of vcpu 0 time to vcpu 1, so that both vcpus will be 100%-busy.
    
    To achieve that, we serialize access to /proc/stat and making readers
    update stats basing on the pcpu usage delta, so that it can fix up per
    vcpu usage to be <= 100% and calculate idle time accordingly.
    
    https://jira.sw.ru/browse/PSBM-26714
    
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
    
    Acked-by: Kirill Tkhai <ktkhai at parallels.com>
    =============================================================================
    
    Author: Vladimir Davydov
    Email: vdavydov at parallels.com
    Subject: sched: fix output of vestat:idle
    Date: Tue, 22 Jul 2014 12:25:24 +0400
    
    /proc/vz/vestat must report virtualized idle time, but since commit
    diff-sched-rework-_proc_stat-output it shows total time CTs have been
    idling on all physical cpus. This is, because in cpu_cgroup_get_stat we
    use task_group->cpustat instead of vcpustat. Fix it.
    
    https://jira.sw.ru/browse/PSBM-28403
    https://bugzilla.openvz.org/show_bug.cgi?id=3035
    
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
    
    Acked-by: Kirill Tkhai <ktkhai at parallels.com>
    =============================================================================
    
    Author: Vladimir Davydov
    Email: vdavydov at parallels.com
    Subject: sched: make allowance for vcpu rate in /proc/stat
    Date: Wed, 13 Aug 2014 15:44:36 +0400
    
    Currently, if cpulimit < cpus * 100 for a CT we can't get 100% cpu usage
    as reported by the CT's /proc/stat. This is, because when reporting cpu
    usage statistics we consider the maximal cpu time a CT can get to be
    equal to cpus * 100, so that in the above-mentioned setup there will
    always be idle time. It confuses our customers, because before commit
    diff-sched-rework-_proc_stat-output it was possible to get 100% usage
    inside a CT irrespective of its cpulimit/cpus setup. So let's fix it by
    defining the maximal cpu usage a CT can get to be equal to cpulimit.
    
    https://jira.sw.ru/browse/PSBM-28500
    
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
    
    Acked-by: Kirill Tkhai <ktkhai at parallels.com>
    =============================================================================
    
    Author: Kirill Tkhai
    Email: ktkhai at parallels.com
    Subject: sched: Hide steal time from inside CT
    Date: Tue, 6 May 2014 18:27:40 +0400
    
    https://jira.sw.ru/browse/PSBM-26587
    
    >From the BUG by Khorenko Konstantin:
    
            "FastVPS complains on incorrect idle time calculation
            (PSBM-23431) and about high _steal_ time reported inside a CT.
            Steal time is a time when a CT was ready to run on a physical CPU,
            but the CPU was busy with processes which belong to another CT.
            => in case we have 10 CTs which eat cpu time as much as possible,
            then steal time in each CT will be 90% and execution time 10% only.
    
    	i suggest to make steal time always shown as 0 inside Containers in
    	order not to confuse end users.  At the same time steal time for
    	Containers should be visible from the HN (host), this is useful for
    	development.
    
            No objections neither from Pasha Emelyanov nor from Maik".
    
    So, we do this for not to scare users.
    
    Signed-off-by: Kirill Tkhai <ktkhai at parallels.com>
    
    Acked-by: Vladimir Davydov <vdavydov at parallels.com>
    =============================================================================
    
    Author: Vladimir Davydov
    Email: vdavydov at parallels.com
    Subject: sched: cpu.proc.stat: always count nr_running and co on all cpus
    Date: Wed, 30 Jul 2014 13:32:20 +0400
    
    Currently we count them only on cpus 0..nr_vcpus, which is obviously
    wrong, because those numbers are kept in a per-pcpu structure.
    
    https://jira.sw.ru/browse/PSBM-28277
    
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
    
    =============================================================================
    
    Related to https://jira.sw.ru/browse/PSBM-33642
    
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
---
 fs/proc/uptime.c            |   2 +-
 include/linux/kernel_stat.h |  36 +++++
 kernel/sched/core.c         | 313 ++++++++++++++++++++++++++++++++++----------
 kernel/sched/fair.c         |  61 +--------
 kernel/sched/sched.h        |  25 +---
 kernel/ve/vecalls.c         |  13 +-
 6 files changed, 298 insertions(+), 152 deletions(-)

diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index dd80f1b..3c49c19 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -30,7 +30,7 @@ static inline void get_veX_idle(struct timespec *idle, struct cgroup* cgrp)
 	struct kernel_cpustat kstat;
 
 	cpu_cgroup_get_stat(cgrp, &kstat);
-	*idle = ns_to_timespec(kstat.cpustat[CPUTIME_IDLE]);
+	cputime_to_timespec(kstat.cpustat[CPUTIME_IDLE], idle);
 }
 
 static int uptime_proc_show(struct seq_file *m, void *v)
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index d105ab3..0086f43 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -36,6 +36,42 @@ struct kernel_cpustat {
 	u64 cpustat[NR_STATS];
 };
 
+static inline u64 kernel_cpustat_total_usage(const struct kernel_cpustat *p)
+{
+	return p->cpustat[CPUTIME_USER] + p->cpustat[CPUTIME_NICE] +
+		p->cpustat[CPUTIME_SYSTEM];
+}
+
+static inline u64 kernel_cpustat_total_idle(const struct kernel_cpustat *p)
+{
+	return p->cpustat[CPUTIME_IDLE] + p->cpustat[CPUTIME_IOWAIT];
+}
+
+static inline void kernel_cpustat_zero(struct kernel_cpustat *p)
+{
+	memset(p, 0, sizeof(*p));
+}
+
+static inline void kernel_cpustat_add(const struct kernel_cpustat *lhs,
+				      const struct kernel_cpustat *rhs,
+				      struct kernel_cpustat *res)
+{
+	int i;
+
+	for (i = 0; i < NR_STATS; i++)
+		res->cpustat[i] = lhs->cpustat[i] + rhs->cpustat[i];
+}
+
+static inline void kernel_cpustat_sub(const struct kernel_cpustat *lhs,
+				      const struct kernel_cpustat *rhs,
+				      struct kernel_cpustat *res)
+{
+	int i;
+
+	for (i = 0; i < NR_STATS; i++)
+		res->cpustat[i] = lhs->cpustat[i] - rhs->cpustat[i];
+}
+
 struct kernel_stat {
 #ifndef CONFIG_GENERIC_HARDIRQS
        unsigned int irqs[NR_IRQS];
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d562f64..5a38d1a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7695,6 +7695,8 @@ static void free_sched_group(struct task_group *tg)
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
 	free_percpu(tg->taskstats);
+	kfree(tg->cpustat_last);
+	kfree(tg->vcpustat);
 	kfree(tg);
 }
 
@@ -7717,6 +7719,19 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!tg->taskstats)
 		goto err;
 
+	tg->cpustat_last = kcalloc(nr_cpu_ids, sizeof(struct kernel_cpustat),
+				   GFP_KERNEL);
+	if (!tg->cpustat_last)
+		goto err;
+
+	tg->vcpustat = kcalloc(nr_cpu_ids, sizeof(struct kernel_cpustat),
+			       GFP_KERNEL);
+	if (!tg->vcpustat)
+		goto err;
+
+	tg->vcpustat_last_update = ktime_set(0, 0);
+	spin_lock_init(&tg->vcpustat_lock);
+
 	/* start_timespec is saved CT0 uptime */
 	do_posix_clock_monotonic_gettime(&tg->start_time);
 	monotonic_to_bootbased(&tg->start_time);
@@ -8333,7 +8348,6 @@ static int __tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	cfs_b->quota = quota;
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
-	update_cfs_bandwidth_idle_scale(cfs_b);
 	/* restart the period timer (if active) to handle new period expiry */
 	if (runtime_enabled && cfs_b->timer_active) {
 		/* force a reprogram */
@@ -8661,19 +8675,6 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-static u64 cpu_cgroup_usage_cpu(struct task_group *tg, int i)
-{
-#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SCHEDSTATS)
-	/* root_task_group has not sched entities */
-	if (tg == &root_task_group)
-		return cpu_rq(i)->rq_cpu_time;
-
-	return tg->se[i]->sum_exec_runtime;
-#else
-	return 0;
-#endif
-}
-
 static void cpu_cgroup_update_stat(struct cgroup *cgrp, int i)
 {
 #if defined(CONFIG_SCHEDSTATS) && defined(CONFIG_FAIR_GROUP_SCHED)
@@ -8681,7 +8682,7 @@ static void cpu_cgroup_update_stat(struct cgroup *cgrp, int i)
 	struct sched_entity *se = tg->se[i];
 	struct kernel_cpustat *kcpustat = cpuacct_cpustat(cgrp, i);
 	u64 now = cpu_clock(i);
-	u64 delta, idle, iowait;
+	u64 delta, idle, iowait, steal, used;
 
 	/* root_task_group has not sched entities */
 	if (tg == &root_task_group)
@@ -8689,7 +8690,8 @@ static void cpu_cgroup_update_stat(struct cgroup *cgrp, int i)
 
 	iowait = se->statistics.iowait_sum;
 	idle = se->statistics.sum_sleep_runtime;
-	kcpustat->cpustat[CPUTIME_STEAL] = se->statistics.wait_sum;
+	steal = se->statistics.wait_sum;
+	used = se->sum_exec_runtime;
 
 	if (idle > iowait)
 		idle -= iowait;
@@ -8699,35 +8701,210 @@ static void cpu_cgroup_update_stat(struct cgroup *cgrp, int i)
 	if (se->statistics.sleep_start) {
 		delta = now - se->statistics.sleep_start;
 		if ((s64)delta > 0)
-			idle += SCALE_IDLE_TIME(delta, se);
+			idle += delta;
 	} else if (se->statistics.block_start) {
 		delta = now - se->statistics.block_start;
 		if ((s64)delta > 0)
-			iowait += SCALE_IDLE_TIME(delta, se);
+			iowait += delta;
 	} else if (se->statistics.wait_start) {
 		delta = now - se->statistics.wait_start;
 		if ((s64)delta > 0)
-			kcpustat->cpustat[CPUTIME_STEAL] += delta;
+			steal += delta;
 	}
 
 	kcpustat->cpustat[CPUTIME_IDLE] =
-		max(kcpustat->cpustat[CPUTIME_IDLE], idle);
+			max(kcpustat->cpustat[CPUTIME_IDLE],
+			    nsecs_to_cputime(idle));
 	kcpustat->cpustat[CPUTIME_IOWAIT] =
-		max(kcpustat->cpustat[CPUTIME_IOWAIT], iowait);
-
-	kcpustat->cpustat[CPUTIME_USED] = cpu_cgroup_usage_cpu(tg, i);
+			max(kcpustat->cpustat[CPUTIME_IOWAIT],
+			    nsecs_to_cputime(iowait));
+	kcpustat->cpustat[CPUTIME_STEAL] = nsecs_to_cputime(steal);
+	kcpustat->cpustat[CPUTIME_USED] = nsecs_to_cputime(used);
 #endif
 }
 
+static void fixup_vcpustat_delta_usage(struct kernel_cpustat *cur,
+				       struct kernel_cpustat *rem, int ind,
+				       u64 cur_usage, u64 target_usage,
+				       u64 rem_usage)
+{
+	s64 scaled_val;
+	u32 scale_pct = 0;
+
+	/* distribute the delta among USER, NICE, and SYSTEM proportionally */
+	if (cur_usage < target_usage) {
+		if ((s64)rem_usage > 0) /* sanity check to avoid div/0 */
+			scale_pct = div64_u64(100 * rem->cpustat[ind],
+					      rem_usage);
+	} else {
+		if ((s64)cur_usage > 0) /* sanity check to avoid div/0 */
+			scale_pct = div64_u64(100 * cur->cpustat[ind],
+					      cur_usage);
+	}
+
+	scaled_val = div_s64(scale_pct * (target_usage - cur_usage), 100);
+
+	cur->cpustat[ind] += scaled_val;
+	if ((s64)cur->cpustat[ind] < 0)
+		cur->cpustat[ind] = 0;
+
+	rem->cpustat[ind] -= scaled_val;
+	if ((s64)rem->cpustat[ind] < 0)
+		rem->cpustat[ind] = 0;
+}
+
+static void calc_vcpustat_delta_idle(struct kernel_cpustat *cur,
+				     int ind, u64 cur_idle, u64 target_idle)
+{
+	/* distribute target_idle between IDLE and IOWAIT proportionally to
+	 * what we initially had on this vcpu */
+	if ((s64)cur_idle > 0) {
+		u32 scale_pct = div64_u64(100 * cur->cpustat[ind], cur_idle);
+		cur->cpustat[ind] = div_u64(scale_pct * target_idle, 100);
+	} else {
+		cur->cpustat[ind] = ind == CPUTIME_IDLE ? target_idle : 0;
+	}
+}
+
+static void fixup_vcpustat_delta(struct kernel_cpustat *cur,
+				 struct kernel_cpustat *rem,
+				 u64 max_usage)
+{
+	u64 cur_usage, target_usage, rem_usage;
+	u64 cur_idle, target_idle;
+
+	cur_usage = kernel_cpustat_total_usage(cur);
+	rem_usage = kernel_cpustat_total_usage(rem);
+
+	target_usage = min(cur_usage + rem_usage,
+			   max_usage);
+
+	if (cur_usage != target_usage) {
+		fixup_vcpustat_delta_usage(cur, rem, CPUTIME_USER,
+				cur_usage, target_usage, rem_usage);
+		fixup_vcpustat_delta_usage(cur, rem, CPUTIME_NICE,
+				cur_usage, target_usage, rem_usage);
+		fixup_vcpustat_delta_usage(cur, rem, CPUTIME_SYSTEM,
+				cur_usage, target_usage, rem_usage);
+	}
+
+	cur_idle = kernel_cpustat_total_idle(cur);
+	target_idle = max_usage - target_usage;
+
+	if (cur_idle != target_idle) {
+		calc_vcpustat_delta_idle(cur, CPUTIME_IDLE,
+					 cur_idle, target_idle);
+		calc_vcpustat_delta_idle(cur, CPUTIME_IOWAIT,
+					 cur_idle, target_idle);
+	}
+
+	cur->cpustat[CPUTIME_USED] = target_usage;
+
+	/* do not show steal time inside ve */
+	cur->cpustat[CPUTIME_STEAL] = 0;
+}
+
+static void cpu_cgroup_update_vcpustat(struct cgroup *cgrp)
+{
+	int i, j;
+	int nr_vcpus;
+	int vcpu_rate;
+	ktime_t now;
+	u64 abs_delta_ns, max_usage;
+	struct kernel_cpustat stat_delta, stat_rem;
+	struct task_group *tg = cgroup_tg(cgrp);
+	int first_pass = 1;
+
+	spin_lock(&tg->vcpustat_lock);
+
+	now = ktime_get();
+	nr_vcpus = tg->nr_cpus ?: num_online_cpus();
+	vcpu_rate = DIV_ROUND_UP(tg->cpu_rate, nr_vcpus);
+	if (!vcpu_rate || vcpu_rate > MAX_CPU_RATE)
+		vcpu_rate = MAX_CPU_RATE;
+
+	if (!ktime_to_ns(tg->vcpustat_last_update)) {
+		/* on the first read initialize vcpu i stat as a sum of stats
+		 * over pcpus j such that j % nr_vcpus == i */
+		for (i = 0; i < nr_vcpus; i++) {
+			for (j = i; j < nr_cpu_ids; j += nr_vcpus) {
+				if (!cpu_possible(j))
+					continue;
+				kernel_cpustat_add(tg->vcpustat + i,
+						   cpuacct_cpustat(cgrp, j),
+						   tg->vcpustat + i);
+			}
+		}
+		goto out_update_last;
+	}
+
+	abs_delta_ns = ktime_to_ns(ktime_sub(now, tg->vcpustat_last_update));
+	max_usage = nsecs_to_cputime(abs_delta_ns);
+	max_usage = div_u64(max_usage * vcpu_rate, MAX_CPU_RATE);
+	/* don't allow to update stats too often to avoid calculation errors */
+	if (max_usage < 10)
+		goto out_unlock;
+
+	/* temporarily copy per cpu usage delta to tg->cpustat_last */
+	for_each_possible_cpu(i)
+		kernel_cpustat_sub(cpuacct_cpustat(cgrp, i),
+				   tg->cpustat_last + i,
+				   tg->cpustat_last + i);
+
+	/* proceed to calculating per vcpu delta */
+	kernel_cpustat_zero(&stat_rem);
+
+again:
+	for (i = 0; i < nr_vcpus; i++) {
+		int exceeds_max;
+
+		kernel_cpustat_zero(&stat_delta);
+		for (j = i; j < nr_cpu_ids; j += nr_vcpus) {
+			if (!cpu_possible(j))
+				continue;
+			kernel_cpustat_add(&stat_delta,
+					   tg->cpustat_last + j, &stat_delta);
+		}
+
+		exceeds_max = kernel_cpustat_total_usage(&stat_delta) >=
+								max_usage;
+		/*
+		 * On the first pass calculate delta for vcpus with usage >
+		 * max_usage in order to accumulate excess in stat_rem.
+		 *
+		 * Once the remainder is accumulated, proceed to the rest of
+		 * vcpus so that it will be distributed among them.
+		*/
+		if (exceeds_max != first_pass)
+			continue;
+
+		fixup_vcpustat_delta(&stat_delta, &stat_rem, max_usage);
+		kernel_cpustat_add(tg->vcpustat + i, &stat_delta,
+				   tg->vcpustat + i);
+	}
+
+	if (first_pass) {
+		first_pass = 0;
+		goto again;
+	}
+out_update_last:
+	for_each_possible_cpu(i)
+		tg->cpustat_last[i] = *cpuacct_cpustat(cgrp, i);
+	tg->vcpustat_last_update = now;
+out_unlock:
+	spin_unlock(&tg->vcpustat_lock);
+}
+
 int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 				struct seq_file *p)
 {
-	int i, j;
-	unsigned int nr_ve_vcpus = num_online_vcpus();
+	int i;
 	unsigned long jif;
 	u64 user, nice, system, idle, iowait, steal;
 	struct timespec boottime;
 	struct task_group *tg = cgroup_tg(cgrp);
+	bool virt = !ve_is_super(get_exec_env()) && tg != &root_task_group;
+	int nr_vcpus = tg->nr_cpus ?: num_online_cpus();
 	struct kernel_cpustat *kcpustat;
 	unsigned long tg_nr_running = 0;
 	unsigned long tg_nr_iowait = 0;
@@ -8737,20 +8914,9 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 	getboottime(&boottime);
 	jif = boottime.tv_sec + tg->start_time.tv_sec;
 
-	user = nice = system = idle = iowait = steal = 0;
-
 	for_each_possible_cpu(i) {
-		kcpustat = cpuacct_cpustat(cgrp, i);
-
 		cpu_cgroup_update_stat(cgrp, i);
 
-		user += kcpustat->cpustat[CPUTIME_USER];
-		nice += kcpustat->cpustat[CPUTIME_NICE];
-		system += kcpustat->cpustat[CPUTIME_SYSTEM];
-		idle += kcpustat->cpustat[CPUTIME_IDLE];
-		iowait += kcpustat->cpustat[CPUTIME_IOWAIT];
-		steal += kcpustat->cpustat[CPUTIME_STEAL];
-
 		/* root task group has autogrouping, so this doesn't hold */
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		tg_nr_running += tg->cfs_rq[i]->nr_running;
@@ -8763,37 +8929,52 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 #endif
 	}
 
+	if (virt)
+		cpu_cgroup_update_vcpustat(cgrp);
+
+	user = nice = system = idle = iowait = steal = 0;
+
+	for (i = 0; i < (virt ? nr_vcpus : nr_cpu_ids); i++) {
+		if (!virt && !cpu_possible(i))
+			continue;
+		kcpustat = virt ? tg->vcpustat + i : cpuacct_cpustat(cgrp, i);
+		user += kcpustat->cpustat[CPUTIME_USER];
+		nice += kcpustat->cpustat[CPUTIME_NICE];
+		system += kcpustat->cpustat[CPUTIME_SYSTEM];
+		idle += kcpustat->cpustat[CPUTIME_IDLE];
+		iowait += kcpustat->cpustat[CPUTIME_IOWAIT];
+		steal += kcpustat->cpustat[CPUTIME_STEAL];
+	}
+
 	seq_printf(p, "cpu  %llu %llu %llu %llu %llu 0 0 %llu\n",
 		(unsigned long long)cputime64_to_clock_t(user),
 		(unsigned long long)cputime64_to_clock_t(nice),
 		(unsigned long long)cputime64_to_clock_t(system),
-		(unsigned long long)nsec_to_clock_t(idle),
-		(unsigned long long)nsec_to_clock_t(iowait),
-		(unsigned long long)nsec_to_clock_t(steal));
-
-	for (i = 0; i < nr_ve_vcpus; i++) {
-		user = nice = system = idle = iowait = steal = 0;
-		for_each_online_cpu(j) {
-			if (j % nr_ve_vcpus != i)
-				continue;
-			kcpustat = cpuacct_cpustat(cgrp, j);
-
-			user += kcpustat->cpustat[CPUTIME_USER];
-			nice += kcpustat->cpustat[CPUTIME_NICE];
-			system += kcpustat->cpustat[CPUTIME_SYSTEM];
-			idle += kcpustat->cpustat[CPUTIME_IDLE];
-			iowait += kcpustat->cpustat[CPUTIME_IOWAIT];
-			steal += kcpustat->cpustat[CPUTIME_STEAL];
-		}
+		(unsigned long long)cputime64_to_clock_t(idle),
+		(unsigned long long)cputime64_to_clock_t(iowait),
+		virt ? 0ULL :
+		(unsigned long long)cputime64_to_clock_t(steal));
+
+	for (i = 0; i < (virt ? nr_vcpus : nr_cpu_ids); i++) {
+		if (!virt && !cpu_online(i))
+			continue;
+		kcpustat = virt ? tg->vcpustat + i : cpuacct_cpustat(cgrp, i);
+		user = kcpustat->cpustat[CPUTIME_USER];
+		nice = kcpustat->cpustat[CPUTIME_NICE];
+		system = kcpustat->cpustat[CPUTIME_SYSTEM];
+		idle = kcpustat->cpustat[CPUTIME_IDLE];
+		iowait = kcpustat->cpustat[CPUTIME_IOWAIT];
+		steal = kcpustat->cpustat[CPUTIME_STEAL];
 		seq_printf(p,
 			"cpu%d %llu %llu %llu %llu %llu 0 0 %llu\n",
 			i,
 			(unsigned long long)cputime64_to_clock_t(user),
 			(unsigned long long)cputime64_to_clock_t(nice),
 			(unsigned long long)cputime64_to_clock_t(system),
-			(unsigned long long)nsec_to_clock_t(idle),
-			(unsigned long long)nsec_to_clock_t(iowait),
-			(unsigned long long)nsec_to_clock_t(steal));
+			(unsigned long long)cputime64_to_clock_t(idle),
+			(unsigned long long)cputime64_to_clock_t(iowait),
+			virt ? 0ULL :
+			(unsigned long long)cputime64_to_clock_t(steal));
 	}
 	seq_printf(p, "intr 0\nswap 0 0\n");
 
@@ -8844,18 +9025,18 @@ int cpu_cgroup_proc_loadavg(struct cgroup *cgrp, struct cftype *cft,
 
 void cpu_cgroup_get_stat(struct cgroup *cgrp, struct kernel_cpustat *kstat)
 {
-	int i, j;
-
-	memset(kstat, 0, sizeof(struct kernel_cpustat));
-
-	for_each_possible_cpu(i) {
-		struct kernel_cpustat *st = cpuacct_cpustat(cgrp, i);
+	struct task_group *tg = cgroup_tg(cgrp);
+	int nr_vcpus = tg->nr_cpus ?: num_online_cpus();
+	int i;
 
+	for_each_possible_cpu(i)
 		cpu_cgroup_update_stat(cgrp, i);
 
-		for (j = 0; j < NR_STATS; j++)
-			kstat->cpustat[j] += st->cpustat[j];
-	}
+	cpu_cgroup_update_vcpustat(cgrp);
+
+	kernel_cpustat_zero(kstat);
+	for (i = 0; i < nr_vcpus; i++)
+		kernel_cpustat_add(tg->vcpustat + i, kstat, kstat);
 }
 
 int cpu_cgroup_get_avenrun(struct cgroup *cgrp, unsigned long *avenrun)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f1ff1d..ecac940 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2868,8 +2868,7 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		if (tsk) {
 			account_scheduler_latency(tsk, delta >> 10, 1);
 			trace_sched_stat_sleep(tsk, delta);
-		} else
-			delta = SCALE_IDLE_TIME(delta, se);
+		}
 
 		se->statistics.sum_sleep_runtime += delta;
 	}
@@ -2904,10 +2903,8 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 						delta >> 20);
 			}
 			account_scheduler_latency(tsk, delta >> 10, 0);
-		} else {
-			delta = SCALE_IDLE_TIME(delta, se);
+		} else
 			se->statistics.iowait_sum += delta;
-		}
 
 		se->statistics.sum_sleep_runtime += delta;
 	}
@@ -3348,55 +3345,6 @@ static inline u64 sched_cfs_bandwidth_slice(void)
 	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
 }
 
-static void restart_tg_idle_time_accounting(struct task_group *tg)
-{
-#ifdef CONFIG_SCHEDSTATS
-	int cpu;
-
-	if (tg == &root_task_group)
-		return;
-
-	/*
-	 * XXX: We call enqueue_sleeper/dequeue_sleeper without rq lock for
-	 * the sake of performance, because in the worst case this can only
-	 * lead to an idle/iowait period lost in stats.
-	 */
-	for_each_online_cpu(cpu) {
-		struct sched_entity *se = tg->se[cpu];
-		struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
-
-		if (!cfs_rq->load.weight) {
-			enqueue_sleeper(cfs_rq_of(se), se);
-			dequeue_sleeper(cfs_rq_of(se), se);
-		}
-	}
-#endif
-}
-
-void update_cfs_bandwidth_idle_scale(struct cfs_bandwidth *cfs_b)
-{
-	u64 runtime = cfs_b->runtime;
-	u64 quota = cfs_b->quota;
-	u64 max_quota = ktime_to_ns(cfs_b->period) * num_online_cpus();
-	struct task_group *tg =
-		container_of(cfs_b, struct task_group, cfs_bandwidth);
-
-	restart_tg_idle_time_accounting(tg);
-
-	/*
-	 * idle_scale = quota_left / (period * nr_idle_cpus)
-	 * nr_idle_cpus = nr_cpus - nr_busy_cpus
-	 * nr_busy_cpus = (quota - quota_left) / period
-	 */
-	if (quota == RUNTIME_INF || quota >= max_quota)
-		cfs_b->idle_scale_inv = CFS_IDLE_SCALE;
-	else if (runtime)
-		cfs_b->idle_scale_inv = div64_u64(CFS_IDLE_SCALE *
-				(max_quota - quota + runtime), runtime);
-	else
-		cfs_b->idle_scale_inv = 0;
-}
-
 /*
  * Replenish runtime according to assigned quota and update expiration time.
  * We use sched_clock_cpu directly instead of rq->clock to avoid adding
@@ -3408,8 +3356,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 {
 	u64 now;
 
-	update_cfs_bandwidth_idle_scale(cfs_b);
-
 	if (cfs_b->quota == RUNTIME_INF)
 		return;
 
@@ -3984,7 +3930,6 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
-	cfs_b->idle_scale_inv = CFS_IDLE_SCALE;
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
@@ -8094,7 +8039,6 @@ static void nr_iowait_dec_fair(struct task_struct *p)
 		se->statistics.block_start = 0;
 		se->statistics.sleep_start = rq->clock;
 
-		delta = SCALE_IDLE_TIME(delta, se);
 		se->statistics.iowait_sum += delta;
 		se->statistics.sum_sleep_runtime += delta;
 	}
@@ -8126,7 +8070,6 @@ static void nr_iowait_inc_fair(struct task_struct *p)
 		se->statistics.sleep_start = 0;
 		se->statistics.block_start = rq->clock;
 
-		delta = SCALE_IDLE_TIME(delta, se);
 		se->statistics.sum_sleep_runtime += delta;
 	}
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e4f92a5..c4f513b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -124,9 +124,6 @@ struct cfs_bandwidth {
 	struct hrtimer period_timer, slack_timer;
 	struct list_head throttled_cfs_rq;
 
-#define CFS_IDLE_SCALE 100
-	u64 idle_scale_inv;
-
 	/* statistics */
 	int nr_periods, nr_throttled;
 	u64 throttled_time;
@@ -171,6 +168,11 @@ struct task_group {
 	unsigned long avenrun[3];	/* loadavg data */
 	struct timespec start_time;
 
+	struct kernel_cpustat *cpustat_last;
+	struct kernel_cpustat *vcpustat;
+	ktime_t vcpustat_last_update;
+	spinlock_t vcpustat_lock;
+
 	struct cfs_bandwidth cfs_bandwidth;
 
 #ifdef CONFIG_CFS_CPULIMIT
@@ -1437,23 +1439,6 @@ static inline u64 default_cfs_period(void)
 {
 	return 100000000ULL;
 }
-
-static inline u64 SCALE_IDLE_TIME(u64 delta, struct sched_entity *se)
-{
-	struct cfs_bandwidth *cfs_b = &se->my_q->tg->cfs_bandwidth;
-	unsigned long idle_scale_inv = cfs_b->idle_scale_inv;
-
-	if (!idle_scale_inv)
-		delta = 0;
-	else if (idle_scale_inv != CFS_IDLE_SCALE)
-		delta = div64_u64(delta * CFS_IDLE_SCALE, idle_scale_inv);
-
-	return delta;
-}
-
-extern void update_cfs_bandwidth_idle_scale(struct cfs_bandwidth *cfs_b);
-#else
-#define SCALE_IDLE_TIME(delta, se) (delta)
 #endif
 
 extern void start_cfs_idle_time_accounting(int cpu);
diff --git a/kernel/ve/vecalls.c b/kernel/ve/vecalls.c
index 32646b1..bff63b3 100644
--- a/kernel/ve/vecalls.c
+++ b/kernel/ve/vecalls.c
@@ -124,7 +124,7 @@ static int ve_get_cpu_stat(envid_t veid, struct vz_cpu_stat __user *buf)
 	vstat->user_jif += (unsigned long)cputime64_to_clock_t(kstat.cpustat[CPUTIME_USER]);
 	vstat->nice_jif += (unsigned long)cputime64_to_clock_t(kstat.cpustat[CPUTIME_NICE]);
 	vstat->system_jif += (unsigned long)cputime64_to_clock_t(kstat.cpustat[CPUTIME_SYSTEM]);
-	vstat->idle_clk += kstat.cpustat[CPUTIME_IDLE];
+	vstat->idle_clk += cputime_to_usecs(kstat.cpustat[CPUTIME_IDLE]) * NSEC_PER_USEC;
 
 	vstat->uptime_clk = ve_get_uptime(ve);
 
@@ -799,11 +799,12 @@ static int vestat_seq_show(struct seq_file *m, void *v)
 		return ret;
 
 	strv_time = 0;
-	user_ve = kstat.cpustat[CPUTIME_USER];
-	nice_ve = kstat.cpustat[CPUTIME_NICE];
-	system_ve = kstat.cpustat[CPUTIME_SYSTEM];
-	used = kstat.cpustat[CPUTIME_USED];
-	idle_time = kstat.cpustat[CPUTIME_IDLE];
+	user_ve = cputime_to_jiffies(kstat.cpustat[CPUTIME_USER]);
+	nice_ve = cputime_to_jiffies(kstat.cpustat[CPUTIME_NICE]);
+	system_ve = cputime_to_jiffies(kstat.cpustat[CPUTIME_SYSTEM]);
+	used = cputime_to_usecs(kstat.cpustat[CPUTIME_USED]) * NSEC_PER_USEC;
+	idle_time = cputime_to_usecs(kstat.cpustat[CPUTIME_IDLE]) *
+							NSEC_PER_USEC;
 
 	uptime_cycles = ve_get_uptime(ve);
 	uptime = get_jiffies_64() - ve->start_jiffies;



More information about the Devel mailing list