[Devel] [PATCH rh7 3/3] Port cpustat related patches
Vladimir Davydov
vdavydov at parallels.com
Fri May 29 10:02:28 PDT 2015
This patch ports:
diff-sched-rework-_proc_stat-output
diff-sched-fix-output-of-vestat-idle
diff-sched-make-allowance-for-vcpu-rate-in-_proc_stat
diff-sched-hide-steal-time-from-inside-CT
diff-sched-cpu.proc.stat-always-count-nr_running-and-co-on-all-cpus
Author: Vladimir Davydov
Email: vdavydov at parallels.com
Subject: sched: rework /proc/stat output
Date: Thu, 29 May 2014 11:40:50 +0400
Initially we mapped usage pct on physical cpu i to vcpu (i % nr_vcpus).
Obviously, if a CT is running on physical cpus equal mod nr_vcpus, we'll
miss usage on one or more vcpus. F.e., if there is a 2-vcpus CT with
several cpu-eaters running on physical cpus 0, 2, we'll get vcpu 0
200%-busy, and vcpu 1 idling.
To fix that, we changed behavior so that vcpu i usage equals total cpu
usage divided by nr_vcpus. That led to customers' dissatisfaction,
because such an algorithm reveals the fake.
So, now we're going to use the first algorithm, but if the usage of one
of vcpus turns out to be greater than abs time delta, we'll "move" the
usage excess to other vcpus, so that one vcpu will never consume more
than one pcpu. F.e., in the situation described above, we'll move 100%
of vcpu 0 time to vcpu 1, so that both vcpus will be 100%-busy.
To achieve that, we serialize access to /proc/stat and making readers
update stats basing on the pcpu usage delta, so that it can fix up per
vcpu usage to be <= 100% and calculate idle time accordingly.
https://jira.sw.ru/browse/PSBM-26714
Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
Acked-by: Kirill Tkhai <ktkhai at parallels.com>
=============================================================================
Author: Vladimir Davydov
Email: vdavydov at parallels.com
Subject: sched: fix output of vestat:idle
Date: Tue, 22 Jul 2014 12:25:24 +0400
/proc/vz/vestat must report virtualized idle time, but since commit
diff-sched-rework-_proc_stat-output it shows total time CTs have been
idling on all physical cpus. This is, because in cpu_cgroup_get_stat we
use task_group->cpustat instead of vcpustat. Fix it.
https://jira.sw.ru/browse/PSBM-28403
https://bugzilla.openvz.org/show_bug.cgi?id=3035
Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
Acked-by: Kirill Tkhai <ktkhai at parallels.com>
=============================================================================
Author: Vladimir Davydov
Email: vdavydov at parallels.com
Subject: sched: make allowance for vcpu rate in /proc/stat
Date: Wed, 13 Aug 2014 15:44:36 +0400
Currently, if cpulimit < cpus * 100 for a CT we can't get 100% cpu usage
as reported by the CT's /proc/stat. This is, because when reporting cpu
usage statistics we consider the maximal cpu time a CT can get to be
equal to cpus * 100, so that in the above-mentioned setup there will
always be idle time. It confuses our customers, because before commit
diff-sched-rework-_proc_stat-output it was possible to get 100% usage
inside a CT irrespective of its cpulimit/cpus setup. So let's fix it by
defining the maximal cpu usage a CT can get to be equal to cpulimit.
https://jira.sw.ru/browse/PSBM-28500
Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
Acked-by: Kirill Tkhai <ktkhai at parallels.com>
=============================================================================
Author: Kirill Tkhai
Email: ktkhai at parallels.com
Subject: sched: Hide steal time from inside CT
Date: Tue, 6 May 2014 18:27:40 +0400
https://jira.sw.ru/browse/PSBM-26587
>From the BUG by Khorenko Konstantin:
"FastVPS complains on incorrect idle time calculation
(PSBM-23431) and about high _steal_ time reported inside a CT.
Steal time is a time when a CT was ready to run on a physical CPU,
but the CPU was busy with processes which belong to another CT.
=> in case we have 10 CTs which eat cpu time as much as possible,
then steal time in each CT will be 90% and execution time 10% only.
i suggest to make steal time always shown as 0 inside Containers in
order not to confuse end users. At the same time steal time for
Containers should be visible from the HN (host), this is useful for
development.
No objections neither from Pasha Emelyanov nor from Maik".
So, we do this for not to scare users.
Signed-off-by: Kirill Tkhai <ktkhai at parallels.com>
Acked-by: Vladimir Davydov <vdavydov at parallels.com>
=============================================================================
Author: Vladimir Davydov
Email: vdavydov at parallels.com
Subject: sched: cpu.proc.stat: always count nr_running and co on all cpus
Date: Wed, 30 Jul 2014 13:32:20 +0400
Currently we count them only on cpus 0..nr_vcpus, which is obviously
wrong, because those numbers are kept in a per-pcpu structure.
https://jira.sw.ru/browse/PSBM-28277
Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
=============================================================================
Related to https://jira.sw.ru/browse/PSBM-33642
Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
---
fs/proc/uptime.c | 2 +-
include/linux/kernel_stat.h | 36 +++++
kernel/sched/core.c | 313 ++++++++++++++++++++++++++++++++++----------
kernel/sched/fair.c | 61 +--------
kernel/sched/sched.h | 25 +---
kernel/ve/vecalls.c | 13 +-
6 files changed, 298 insertions(+), 152 deletions(-)
diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index dd80f1bf831d..3c49c19960b2 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -30,7 +30,7 @@ static inline void get_veX_idle(struct timespec *idle, struct cgroup* cgrp)
struct kernel_cpustat kstat;
cpu_cgroup_get_stat(cgrp, &kstat);
- *idle = ns_to_timespec(kstat.cpustat[CPUTIME_IDLE]);
+ cputime_to_timespec(kstat.cpustat[CPUTIME_IDLE], idle);
}
static int uptime_proc_show(struct seq_file *m, void *v)
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index d105ab3f0b8b..0086f43d5590 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -36,6 +36,42 @@ struct kernel_cpustat {
u64 cpustat[NR_STATS];
};
+static inline u64 kernel_cpustat_total_usage(const struct kernel_cpustat *p)
+{
+ return p->cpustat[CPUTIME_USER] + p->cpustat[CPUTIME_NICE] +
+ p->cpustat[CPUTIME_SYSTEM];
+}
+
+static inline u64 kernel_cpustat_total_idle(const struct kernel_cpustat *p)
+{
+ return p->cpustat[CPUTIME_IDLE] + p->cpustat[CPUTIME_IOWAIT];
+}
+
+static inline void kernel_cpustat_zero(struct kernel_cpustat *p)
+{
+ memset(p, 0, sizeof(*p));
+}
+
+static inline void kernel_cpustat_add(const struct kernel_cpustat *lhs,
+ const struct kernel_cpustat *rhs,
+ struct kernel_cpustat *res)
+{
+ int i;
+
+ for (i = 0; i < NR_STATS; i++)
+ res->cpustat[i] = lhs->cpustat[i] + rhs->cpustat[i];
+}
+
+static inline void kernel_cpustat_sub(const struct kernel_cpustat *lhs,
+ const struct kernel_cpustat *rhs,
+ struct kernel_cpustat *res)
+{
+ int i;
+
+ for (i = 0; i < NR_STATS; i++)
+ res->cpustat[i] = lhs->cpustat[i] - rhs->cpustat[i];
+}
+
struct kernel_stat {
#ifndef CONFIG_GENERIC_HARDIRQS
unsigned int irqs[NR_IRQS];
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d562f6430c67..5a38d1aa027e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7695,6 +7695,8 @@ static void free_sched_group(struct task_group *tg)
free_rt_sched_group(tg);
autogroup_free(tg);
free_percpu(tg->taskstats);
+ kfree(tg->cpustat_last);
+ kfree(tg->vcpustat);
kfree(tg);
}
@@ -7717,6 +7719,19 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!tg->taskstats)
goto err;
+ tg->cpustat_last = kcalloc(nr_cpu_ids, sizeof(struct kernel_cpustat),
+ GFP_KERNEL);
+ if (!tg->cpustat_last)
+ goto err;
+
+ tg->vcpustat = kcalloc(nr_cpu_ids, sizeof(struct kernel_cpustat),
+ GFP_KERNEL);
+ if (!tg->vcpustat)
+ goto err;
+
+ tg->vcpustat_last_update = ktime_set(0, 0);
+ spin_lock_init(&tg->vcpustat_lock);
+
/* start_timespec is saved CT0 uptime */
do_posix_clock_monotonic_gettime(&tg->start_time);
monotonic_to_bootbased(&tg->start_time);
@@ -8333,7 +8348,6 @@ static int __tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
cfs_b->quota = quota;
__refill_cfs_bandwidth_runtime(cfs_b);
- update_cfs_bandwidth_idle_scale(cfs_b);
/* restart the period timer (if active) to handle new period expiry */
if (runtime_enabled && cfs_b->timer_active) {
/* force a reprogram */
@@ -8661,19 +8675,6 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
}
#endif /* CONFIG_RT_GROUP_SCHED */
-static u64 cpu_cgroup_usage_cpu(struct task_group *tg, int i)
-{
-#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SCHEDSTATS)
- /* root_task_group has not sched entities */
- if (tg == &root_task_group)
- return cpu_rq(i)->rq_cpu_time;
-
- return tg->se[i]->sum_exec_runtime;
-#else
- return 0;
-#endif
-}
-
static void cpu_cgroup_update_stat(struct cgroup *cgrp, int i)
{
#if defined(CONFIG_SCHEDSTATS) && defined(CONFIG_FAIR_GROUP_SCHED)
@@ -8681,7 +8682,7 @@ static void cpu_cgroup_update_stat(struct cgroup *cgrp, int i)
struct sched_entity *se = tg->se[i];
struct kernel_cpustat *kcpustat = cpuacct_cpustat(cgrp, i);
u64 now = cpu_clock(i);
- u64 delta, idle, iowait;
+ u64 delta, idle, iowait, steal, used;
/* root_task_group has not sched entities */
if (tg == &root_task_group)
@@ -8689,7 +8690,8 @@ static void cpu_cgroup_update_stat(struct cgroup *cgrp, int i)
iowait = se->statistics.iowait_sum;
idle = se->statistics.sum_sleep_runtime;
- kcpustat->cpustat[CPUTIME_STEAL] = se->statistics.wait_sum;
+ steal = se->statistics.wait_sum;
+ used = se->sum_exec_runtime;
if (idle > iowait)
idle -= iowait;
@@ -8699,35 +8701,210 @@ static void cpu_cgroup_update_stat(struct cgroup *cgrp, int i)
if (se->statistics.sleep_start) {
delta = now - se->statistics.sleep_start;
if ((s64)delta > 0)
- idle += SCALE_IDLE_TIME(delta, se);
+ idle += delta;
} else if (se->statistics.block_start) {
delta = now - se->statistics.block_start;
if ((s64)delta > 0)
- iowait += SCALE_IDLE_TIME(delta, se);
+ iowait += delta;
} else if (se->statistics.wait_start) {
delta = now - se->statistics.wait_start;
if ((s64)delta > 0)
- kcpustat->cpustat[CPUTIME_STEAL] += delta;
+ steal += delta;
}
kcpustat->cpustat[CPUTIME_IDLE] =
- max(kcpustat->cpustat[CPUTIME_IDLE], idle);
+ max(kcpustat->cpustat[CPUTIME_IDLE],
+ nsecs_to_cputime(idle));
kcpustat->cpustat[CPUTIME_IOWAIT] =
- max(kcpustat->cpustat[CPUTIME_IOWAIT], iowait);
-
- kcpustat->cpustat[CPUTIME_USED] = cpu_cgroup_usage_cpu(tg, i);
+ max(kcpustat->cpustat[CPUTIME_IOWAIT],
+ nsecs_to_cputime(iowait));
+ kcpustat->cpustat[CPUTIME_STEAL] = nsecs_to_cputime(steal);
+ kcpustat->cpustat[CPUTIME_USED] = nsecs_to_cputime(used);
#endif
}
+static void fixup_vcpustat_delta_usage(struct kernel_cpustat *cur,
+ struct kernel_cpustat *rem, int ind,
+ u64 cur_usage, u64 target_usage,
+ u64 rem_usage)
+{
+ s64 scaled_val;
+ u32 scale_pct = 0;
+
+ /* distribute the delta among USER, NICE, and SYSTEM proportionally */
+ if (cur_usage < target_usage) {
+ if ((s64)rem_usage > 0) /* sanity check to avoid div/0 */
+ scale_pct = div64_u64(100 * rem->cpustat[ind],
+ rem_usage);
+ } else {
+ if ((s64)cur_usage > 0) /* sanity check to avoid div/0 */
+ scale_pct = div64_u64(100 * cur->cpustat[ind],
+ cur_usage);
+ }
+
+ scaled_val = div_s64(scale_pct * (target_usage - cur_usage), 100);
+
+ cur->cpustat[ind] += scaled_val;
+ if ((s64)cur->cpustat[ind] < 0)
+ cur->cpustat[ind] = 0;
+
+ rem->cpustat[ind] -= scaled_val;
+ if ((s64)rem->cpustat[ind] < 0)
+ rem->cpustat[ind] = 0;
+}
+
+static void calc_vcpustat_delta_idle(struct kernel_cpustat *cur,
+ int ind, u64 cur_idle, u64 target_idle)
+{
+ /* distribute target_idle between IDLE and IOWAIT proportionally to
+ * what we initially had on this vcpu */
+ if ((s64)cur_idle > 0) {
+ u32 scale_pct = div64_u64(100 * cur->cpustat[ind], cur_idle);
+ cur->cpustat[ind] = div_u64(scale_pct * target_idle, 100);
+ } else {
+ cur->cpustat[ind] = ind == CPUTIME_IDLE ? target_idle : 0;
+ }
+}
+
+static void fixup_vcpustat_delta(struct kernel_cpustat *cur,
+ struct kernel_cpustat *rem,
+ u64 max_usage)
+{
+ u64 cur_usage, target_usage, rem_usage;
+ u64 cur_idle, target_idle;
+
+ cur_usage = kernel_cpustat_total_usage(cur);
+ rem_usage = kernel_cpustat_total_usage(rem);
+
+ target_usage = min(cur_usage + rem_usage,
+ max_usage);
+
+ if (cur_usage != target_usage) {
+ fixup_vcpustat_delta_usage(cur, rem, CPUTIME_USER,
+ cur_usage, target_usage, rem_usage);
+ fixup_vcpustat_delta_usage(cur, rem, CPUTIME_NICE,
+ cur_usage, target_usage, rem_usage);
+ fixup_vcpustat_delta_usage(cur, rem, CPUTIME_SYSTEM,
+ cur_usage, target_usage, rem_usage);
+ }
+
+ cur_idle = kernel_cpustat_total_idle(cur);
+ target_idle = max_usage - target_usage;
+
+ if (cur_idle != target_idle) {
+ calc_vcpustat_delta_idle(cur, CPUTIME_IDLE,
+ cur_idle, target_idle);
+ calc_vcpustat_delta_idle(cur, CPUTIME_IOWAIT,
+ cur_idle, target_idle);
+ }
+
+ cur->cpustat[CPUTIME_USED] = target_usage;
+
+ /* do not show steal time inside ve */
+ cur->cpustat[CPUTIME_STEAL] = 0;
+}
+
+static void cpu_cgroup_update_vcpustat(struct cgroup *cgrp)
+{
+ int i, j;
+ int nr_vcpus;
+ int vcpu_rate;
+ ktime_t now;
+ u64 abs_delta_ns, max_usage;
+ struct kernel_cpustat stat_delta, stat_rem;
+ struct task_group *tg = cgroup_tg(cgrp);
+ int first_pass = 1;
+
+ spin_lock(&tg->vcpustat_lock);
+
+ now = ktime_get();
+ nr_vcpus = tg->nr_cpus ?: num_online_cpus();
+ vcpu_rate = DIV_ROUND_UP(tg->cpu_rate, nr_vcpus);
+ if (!vcpu_rate || vcpu_rate > MAX_CPU_RATE)
+ vcpu_rate = MAX_CPU_RATE;
+
+ if (!ktime_to_ns(tg->vcpustat_last_update)) {
+ /* on the first read initialize vcpu i stat as a sum of stats
+ * over pcpus j such that j % nr_vcpus == i */
+ for (i = 0; i < nr_vcpus; i++) {
+ for (j = i; j < nr_cpu_ids; j += nr_vcpus) {
+ if (!cpu_possible(j))
+ continue;
+ kernel_cpustat_add(tg->vcpustat + i,
+ cpuacct_cpustat(cgrp, j),
+ tg->vcpustat + i);
+ }
+ }
+ goto out_update_last;
+ }
+
+ abs_delta_ns = ktime_to_ns(ktime_sub(now, tg->vcpustat_last_update));
+ max_usage = nsecs_to_cputime(abs_delta_ns);
+ max_usage = div_u64(max_usage * vcpu_rate, MAX_CPU_RATE);
+ /* don't allow to update stats too often to avoid calculation errors */
+ if (max_usage < 10)
+ goto out_unlock;
+
+ /* temporarily copy per cpu usage delta to tg->cpustat_last */
+ for_each_possible_cpu(i)
+ kernel_cpustat_sub(cpuacct_cpustat(cgrp, i),
+ tg->cpustat_last + i,
+ tg->cpustat_last + i);
+
+ /* proceed to calculating per vcpu delta */
+ kernel_cpustat_zero(&stat_rem);
+
+again:
+ for (i = 0; i < nr_vcpus; i++) {
+ int exceeds_max;
+
+ kernel_cpustat_zero(&stat_delta);
+ for (j = i; j < nr_cpu_ids; j += nr_vcpus) {
+ if (!cpu_possible(j))
+ continue;
+ kernel_cpustat_add(&stat_delta,
+ tg->cpustat_last + j, &stat_delta);
+ }
+
+ exceeds_max = kernel_cpustat_total_usage(&stat_delta) >=
+ max_usage;
+ /*
+ * On the first pass calculate delta for vcpus with usage >
+ * max_usage in order to accumulate excess in stat_rem.
+ *
+ * Once the remainder is accumulated, proceed to the rest of
+ * vcpus so that it will be distributed among them.
+ */
+ if (exceeds_max != first_pass)
+ continue;
+
+ fixup_vcpustat_delta(&stat_delta, &stat_rem, max_usage);
+ kernel_cpustat_add(tg->vcpustat + i, &stat_delta,
+ tg->vcpustat + i);
+ }
+
+ if (first_pass) {
+ first_pass = 0;
+ goto again;
+ }
+out_update_last:
+ for_each_possible_cpu(i)
+ tg->cpustat_last[i] = *cpuacct_cpustat(cgrp, i);
+ tg->vcpustat_last_update = now;
+out_unlock:
+ spin_unlock(&tg->vcpustat_lock);
+}
+
int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
struct seq_file *p)
{
- int i, j;
- unsigned int nr_ve_vcpus = num_online_vcpus();
+ int i;
unsigned long jif;
u64 user, nice, system, idle, iowait, steal;
struct timespec boottime;
struct task_group *tg = cgroup_tg(cgrp);
+ bool virt = !ve_is_super(get_exec_env()) && tg != &root_task_group;
+ int nr_vcpus = tg->nr_cpus ?: num_online_cpus();
struct kernel_cpustat *kcpustat;
unsigned long tg_nr_running = 0;
unsigned long tg_nr_iowait = 0;
@@ -8737,20 +8914,9 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
getboottime(&boottime);
jif = boottime.tv_sec + tg->start_time.tv_sec;
- user = nice = system = idle = iowait = steal = 0;
-
for_each_possible_cpu(i) {
- kcpustat = cpuacct_cpustat(cgrp, i);
-
cpu_cgroup_update_stat(cgrp, i);
- user += kcpustat->cpustat[CPUTIME_USER];
- nice += kcpustat->cpustat[CPUTIME_NICE];
- system += kcpustat->cpustat[CPUTIME_SYSTEM];
- idle += kcpustat->cpustat[CPUTIME_IDLE];
- iowait += kcpustat->cpustat[CPUTIME_IOWAIT];
- steal += kcpustat->cpustat[CPUTIME_STEAL];
-
/* root task group has autogrouping, so this doesn't hold */
#ifdef CONFIG_FAIR_GROUP_SCHED
tg_nr_running += tg->cfs_rq[i]->nr_running;
@@ -8763,37 +8929,52 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
#endif
}
+ if (virt)
+ cpu_cgroup_update_vcpustat(cgrp);
+
+ user = nice = system = idle = iowait = steal = 0;
+
+ for (i = 0; i < (virt ? nr_vcpus : nr_cpu_ids); i++) {
+ if (!virt && !cpu_possible(i))
+ continue;
+ kcpustat = virt ? tg->vcpustat + i : cpuacct_cpustat(cgrp, i);
+ user += kcpustat->cpustat[CPUTIME_USER];
+ nice += kcpustat->cpustat[CPUTIME_NICE];
+ system += kcpustat->cpustat[CPUTIME_SYSTEM];
+ idle += kcpustat->cpustat[CPUTIME_IDLE];
+ iowait += kcpustat->cpustat[CPUTIME_IOWAIT];
+ steal += kcpustat->cpustat[CPUTIME_STEAL];
+ }
+
seq_printf(p, "cpu %llu %llu %llu %llu %llu 0 0 %llu\n",
(unsigned long long)cputime64_to_clock_t(user),
(unsigned long long)cputime64_to_clock_t(nice),
(unsigned long long)cputime64_to_clock_t(system),
- (unsigned long long)nsec_to_clock_t(idle),
- (unsigned long long)nsec_to_clock_t(iowait),
- (unsigned long long)nsec_to_clock_t(steal));
-
- for (i = 0; i < nr_ve_vcpus; i++) {
- user = nice = system = idle = iowait = steal = 0;
- for_each_online_cpu(j) {
- if (j % nr_ve_vcpus != i)
- continue;
- kcpustat = cpuacct_cpustat(cgrp, j);
-
- user += kcpustat->cpustat[CPUTIME_USER];
- nice += kcpustat->cpustat[CPUTIME_NICE];
- system += kcpustat->cpustat[CPUTIME_SYSTEM];
- idle += kcpustat->cpustat[CPUTIME_IDLE];
- iowait += kcpustat->cpustat[CPUTIME_IOWAIT];
- steal += kcpustat->cpustat[CPUTIME_STEAL];
- }
+ (unsigned long long)cputime64_to_clock_t(idle),
+ (unsigned long long)cputime64_to_clock_t(iowait),
+ virt ? 0ULL :
+ (unsigned long long)cputime64_to_clock_t(steal));
+
+ for (i = 0; i < (virt ? nr_vcpus : nr_cpu_ids); i++) {
+ if (!virt && !cpu_online(i))
+ continue;
+ kcpustat = virt ? tg->vcpustat + i : cpuacct_cpustat(cgrp, i);
+ user = kcpustat->cpustat[CPUTIME_USER];
+ nice = kcpustat->cpustat[CPUTIME_NICE];
+ system = kcpustat->cpustat[CPUTIME_SYSTEM];
+ idle = kcpustat->cpustat[CPUTIME_IDLE];
+ iowait = kcpustat->cpustat[CPUTIME_IOWAIT];
+ steal = kcpustat->cpustat[CPUTIME_STEAL];
seq_printf(p,
"cpu%d %llu %llu %llu %llu %llu 0 0 %llu\n",
i,
(unsigned long long)cputime64_to_clock_t(user),
(unsigned long long)cputime64_to_clock_t(nice),
(unsigned long long)cputime64_to_clock_t(system),
- (unsigned long long)nsec_to_clock_t(idle),
- (unsigned long long)nsec_to_clock_t(iowait),
- (unsigned long long)nsec_to_clock_t(steal));
+ (unsigned long long)cputime64_to_clock_t(idle),
+ (unsigned long long)cputime64_to_clock_t(iowait),
+ virt ? 0ULL :
+ (unsigned long long)cputime64_to_clock_t(steal));
}
seq_printf(p, "intr 0\nswap 0 0\n");
@@ -8844,18 +9025,18 @@ int cpu_cgroup_proc_loadavg(struct cgroup *cgrp, struct cftype *cft,
void cpu_cgroup_get_stat(struct cgroup *cgrp, struct kernel_cpustat *kstat)
{
- int i, j;
-
- memset(kstat, 0, sizeof(struct kernel_cpustat));
-
- for_each_possible_cpu(i) {
- struct kernel_cpustat *st = cpuacct_cpustat(cgrp, i);
+ struct task_group *tg = cgroup_tg(cgrp);
+ int nr_vcpus = tg->nr_cpus ?: num_online_cpus();
+ int i;
+ for_each_possible_cpu(i)
cpu_cgroup_update_stat(cgrp, i);
- for (j = 0; j < NR_STATS; j++)
- kstat->cpustat[j] += st->cpustat[j];
- }
+ cpu_cgroup_update_vcpustat(cgrp);
+
+ kernel_cpustat_zero(kstat);
+ for (i = 0; i < nr_vcpus; i++)
+ kernel_cpustat_add(tg->vcpustat + i, kstat, kstat);
}
int cpu_cgroup_get_avenrun(struct cgroup *cgrp, unsigned long *avenrun)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f1ff1da8a29..ecac940ddebd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2868,8 +2868,7 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
if (tsk) {
account_scheduler_latency(tsk, delta >> 10, 1);
trace_sched_stat_sleep(tsk, delta);
- } else
- delta = SCALE_IDLE_TIME(delta, se);
+ }
se->statistics.sum_sleep_runtime += delta;
}
@@ -2904,10 +2903,8 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
delta >> 20);
}
account_scheduler_latency(tsk, delta >> 10, 0);
- } else {
- delta = SCALE_IDLE_TIME(delta, se);
+ } else
se->statistics.iowait_sum += delta;
- }
se->statistics.sum_sleep_runtime += delta;
}
@@ -3348,55 +3345,6 @@ static inline u64 sched_cfs_bandwidth_slice(void)
return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
}
-static void restart_tg_idle_time_accounting(struct task_group *tg)
-{
-#ifdef CONFIG_SCHEDSTATS
- int cpu;
-
- if (tg == &root_task_group)
- return;
-
- /*
- * XXX: We call enqueue_sleeper/dequeue_sleeper without rq lock for
- * the sake of performance, because in the worst case this can only
- * lead to an idle/iowait period lost in stats.
- */
- for_each_online_cpu(cpu) {
- struct sched_entity *se = tg->se[cpu];
- struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
-
- if (!cfs_rq->load.weight) {
- enqueue_sleeper(cfs_rq_of(se), se);
- dequeue_sleeper(cfs_rq_of(se), se);
- }
- }
-#endif
-}
-
-void update_cfs_bandwidth_idle_scale(struct cfs_bandwidth *cfs_b)
-{
- u64 runtime = cfs_b->runtime;
- u64 quota = cfs_b->quota;
- u64 max_quota = ktime_to_ns(cfs_b->period) * num_online_cpus();
- struct task_group *tg =
- container_of(cfs_b, struct task_group, cfs_bandwidth);
-
- restart_tg_idle_time_accounting(tg);
-
- /*
- * idle_scale = quota_left / (period * nr_idle_cpus)
- * nr_idle_cpus = nr_cpus - nr_busy_cpus
- * nr_busy_cpus = (quota - quota_left) / period
- */
- if (quota == RUNTIME_INF || quota >= max_quota)
- cfs_b->idle_scale_inv = CFS_IDLE_SCALE;
- else if (runtime)
- cfs_b->idle_scale_inv = div64_u64(CFS_IDLE_SCALE *
- (max_quota - quota + runtime), runtime);
- else
- cfs_b->idle_scale_inv = 0;
-}
-
/*
* Replenish runtime according to assigned quota and update expiration time.
* We use sched_clock_cpu directly instead of rq->clock to avoid adding
@@ -3408,8 +3356,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
{
u64 now;
- update_cfs_bandwidth_idle_scale(cfs_b);
-
if (cfs_b->quota == RUNTIME_INF)
return;
@@ -3984,7 +3930,6 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
cfs_b->runtime = 0;
cfs_b->quota = RUNTIME_INF;
cfs_b->period = ns_to_ktime(default_cfs_period());
- cfs_b->idle_scale_inv = CFS_IDLE_SCALE;
INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
@@ -8094,7 +8039,6 @@ static void nr_iowait_dec_fair(struct task_struct *p)
se->statistics.block_start = 0;
se->statistics.sleep_start = rq->clock;
- delta = SCALE_IDLE_TIME(delta, se);
se->statistics.iowait_sum += delta;
se->statistics.sum_sleep_runtime += delta;
}
@@ -8126,7 +8070,6 @@ static void nr_iowait_inc_fair(struct task_struct *p)
se->statistics.sleep_start = 0;
se->statistics.block_start = rq->clock;
- delta = SCALE_IDLE_TIME(delta, se);
se->statistics.sum_sleep_runtime += delta;
}
#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e4f92a552d75..c4f513bc668b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -124,9 +124,6 @@ struct cfs_bandwidth {
struct hrtimer period_timer, slack_timer;
struct list_head throttled_cfs_rq;
-#define CFS_IDLE_SCALE 100
- u64 idle_scale_inv;
-
/* statistics */
int nr_periods, nr_throttled;
u64 throttled_time;
@@ -171,6 +168,11 @@ struct task_group {
unsigned long avenrun[3]; /* loadavg data */
struct timespec start_time;
+ struct kernel_cpustat *cpustat_last;
+ struct kernel_cpustat *vcpustat;
+ ktime_t vcpustat_last_update;
+ spinlock_t vcpustat_lock;
+
struct cfs_bandwidth cfs_bandwidth;
#ifdef CONFIG_CFS_CPULIMIT
@@ -1437,23 +1439,6 @@ static inline u64 default_cfs_period(void)
{
return 100000000ULL;
}
-
-static inline u64 SCALE_IDLE_TIME(u64 delta, struct sched_entity *se)
-{
- struct cfs_bandwidth *cfs_b = &se->my_q->tg->cfs_bandwidth;
- unsigned long idle_scale_inv = cfs_b->idle_scale_inv;
-
- if (!idle_scale_inv)
- delta = 0;
- else if (idle_scale_inv != CFS_IDLE_SCALE)
- delta = div64_u64(delta * CFS_IDLE_SCALE, idle_scale_inv);
-
- return delta;
-}
-
-extern void update_cfs_bandwidth_idle_scale(struct cfs_bandwidth *cfs_b);
-#else
-#define SCALE_IDLE_TIME(delta, se) (delta)
#endif
extern void start_cfs_idle_time_accounting(int cpu);
diff --git a/kernel/ve/vecalls.c b/kernel/ve/vecalls.c
index 63c2ee0fd47d..02f57433e12d 100644
--- a/kernel/ve/vecalls.c
+++ b/kernel/ve/vecalls.c
@@ -124,7 +124,7 @@ static int ve_get_cpu_stat(envid_t veid, struct vz_cpu_stat __user *buf)
vstat->user_jif += (unsigned long)cputime64_to_clock_t(kstat.cpustat[CPUTIME_USER]);
vstat->nice_jif += (unsigned long)cputime64_to_clock_t(kstat.cpustat[CPUTIME_NICE]);
vstat->system_jif += (unsigned long)cputime64_to_clock_t(kstat.cpustat[CPUTIME_SYSTEM]);
- vstat->idle_clk += kstat.cpustat[CPUTIME_IDLE];
+ vstat->idle_clk += cputime_to_usecs(kstat.cpustat[CPUTIME_IDLE]) * NSEC_PER_USEC;
vstat->uptime_clk = ve_get_uptime(ve);
@@ -799,11 +799,12 @@ static int vestat_seq_show(struct seq_file *m, void *v)
return ret;
strv_time = 0;
- user_ve = kstat.cpustat[CPUTIME_USER];
- nice_ve = kstat.cpustat[CPUTIME_NICE];
- system_ve = kstat.cpustat[CPUTIME_SYSTEM];
- used = kstat.cpustat[CPUTIME_USED];
- idle_time = kstat.cpustat[CPUTIME_IDLE];
+ user_ve = cputime_to_jiffies(kstat.cpustat[CPUTIME_USER]);
+ nice_ve = cputime_to_jiffies(kstat.cpustat[CPUTIME_NICE]);
+ system_ve = cputime_to_jiffies(kstat.cpustat[CPUTIME_SYSTEM]);
+ used = cputime_to_usecs(kstat.cpustat[CPUTIME_USED]) * NSEC_PER_USEC;
+ idle_time = cputime_to_usecs(kstat.cpustat[CPUTIME_IDLE]) *
+ NSEC_PER_USEC;
uptime_cycles = ve_get_uptime(ve);
uptime = get_jiffies_64() - ve->start_jiffies;
--
2.1.4
More information about the Devel
mailing list