[Devel] [PATCH rh8 5/8] ve/proc/stat: Introduce /proc/stat virtualized handler for Containers

Konstantin Khorenko khorenko at virtuozzo.com
Wed Oct 28 18:57:34 MSK 2020


vz8 rebase notes:
  * "swap 0 0" line has been dropped
  * extra empty line between "intr" and "ctxt" has been dropped

Known issues:
 - it's known "procs_blocked" is shown incorrectly inside a CT
TODO: to fix

Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>

Commit messages of related patches in vz7:
==========================================
sched: use cpuacct->cpustat for showing cpu stats

In contrast to RH6, where tg->cpustat was used, now cpu stats are
accounted in the cpuacct cgroup. So zap tg->cpustat and use
cpuacct->cpustat for showing cpu.proc.stat instead. Fortunately cpu and
cpuacct cgroups are always mounted together (even by systemd by
default), so this will work.

Related to https://jira.sw.ru/browse/PSBM-33642

Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>

+++
Use ve init task's css instead of opening cgroup via vfs

Currently, whenever we need to get cpu or devices cgroup corresponding
to a ve, we open it using cgroup_kernel_open(). This is inflexible,
because it relies on the fact that all container cgroups are located at
a specific location which can never change (at the top level). Since we
want to move container cgroups to machine.slice, we need to rework this.

This patch does the trick. It makes each ve remember its init task at
container start, and use css corresponding to init task whenever we need
to get a corresponding cgroup. Note, that after this patch is applied,
we don't need to mount cpu and devices cgroup in kernel.

https://jira.sw.ru/browse/PSBM-48629

Signed-off-by: Vladimir Davydov <vdavydov at virtuozzo.com>

+++
ve/cpustat: don't try to update vcpustats for root_task_group

root_task_group doesn't have vcpu stats. Attempt to update them leads
to NULL-ptr deref:

        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff810b440c>] cpu_cgroup_update_vcpustat+0x13c/0x620
        ...
        Call Trace:
         [<ffffffff810bee3b>] cpu_cgroup_get_stat+0x7b/0x180
         [<ffffffff810f1ef7>] ve_get_cpu_stat+0x27/0x70
         [<ffffffffa01836a1>] fill_cpu_stat+0x91/0x1e0 [vzmon]
         [<ffffffffa0183c6b>] vzcalls_ioctl+0x2bb/0x430 [vzmon]
         [<ffffffffa018d0d5>] vzctl_ioctl+0x45/0x60 [vzdev]
         [<ffffffff8120cfb5>] do_vfs_ioctl+0x255/0x4f0
         [<ffffffff8120d2a4>] SyS_ioctl+0x54/0xa0
         [<ffffffff81642ac9>] system_call_fastpath+0x16/0x1b

So, return -ENOENT if we asked for vcpu stats of root_task_group.

https://jira.sw.ru/browse/PSBM-48721

Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
Reviewed-by: Vladimir Davydov <vdavydov at virtuozzo.com>

+++
sched: Port cpustat related patches

This patch ports:

diff-sched-rework-_proc_stat-output
diff-sched-fix-output-of-vestat-idle
diff-sched-make-allowance-for-vcpu-rate-in-_proc_stat
diff-sched-hide-steal-time-from-inside-CT
diff-sched-cpu.proc.stat-always-count-nr_running-and-co-on-all-cpus

Author: Vladimir Davydov
Email: vdavydov at parallels.com
Subject: sched: rework /proc/stat output
Date: Thu, 29 May 2014 11:40:50 +0400

Initially we mapped usage pct on physical cpu i to vcpu (i % nr_vcpus).
Obviously, if a CT is running on physical cpus equal mod nr_vcpus, we'll
miss usage on one or more vcpus. F.e., if there is a 2-vcpus CT with
several cpu-eaters running on physical cpus 0, 2, we'll get vcpu 0
200%-busy, and vcpu 1 idling.

To fix that, we changed behavior so that vcpu i usage equals total cpu
usage divided by nr_vcpus. That led to customers' dissatisfaction,
because such an algorithm reveals the fake.

So, now we're going to use the first algorithm, but if the usage of one
of vcpus turns out to be greater than abs time delta, we'll "move" the
usage excess to other vcpus, so that one vcpu will never consume more
than one pcpu. F.e., in the situation described above, we'll move 100%
of vcpu 0 time to vcpu 1, so that both vcpus will be 100%-busy.

To achieve that, we serialize access to /proc/stat and making readers
update stats basing on the pcpu usage delta, so that it can fix up per
vcpu usage to be <= 100% and calculate idle time accordingly.

https://jira.sw.ru/browse/PSBM-26714

Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>

Acked-by: Kirill Tkhai <ktkhai at parallels.com>
=============================================================================

Author: Vladimir Davydov
Email: vdavydov at parallels.com
Subject: sched: fix output of vestat:idle
Date: Tue, 22 Jul 2014 12:25:24 +0400

/proc/vz/vestat must report virtualized idle time, but since commit
diff-sched-rework-_proc_stat-output it shows total time CTs have been
idling on all physical cpus. This is, because in cpu_cgroup_get_stat we
use task_group->cpustat instead of vcpustat. Fix it.

https://jira.sw.ru/browse/PSBM-28403
https://bugzilla.openvz.org/show_bug.cgi?id=3035

Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>

Acked-by: Kirill Tkhai <ktkhai at parallels.com>
=============================================================================

Author: Vladimir Davydov
Email: vdavydov at parallels.com
Subject: sched: make allowance for vcpu rate in /proc/stat
Date: Wed, 13 Aug 2014 15:44:36 +0400

Currently, if cpulimit < cpus * 100 for a CT we can't get 100% cpu usage
as reported by the CT's /proc/stat. This is, because when reporting cpu
usage statistics we consider the maximal cpu time a CT can get to be
equal to cpus * 100, so that in the above-mentioned setup there will
always be idle time. It confuses our customers, because before commit
diff-sched-rework-_proc_stat-output it was possible to get 100% usage
inside a CT irrespective of its cpulimit/cpus setup. So let's fix it by
defining the maximal cpu usage a CT can get to be equal to cpulimit.

https://jira.sw.ru/browse/PSBM-28500

Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>

Acked-by: Kirill Tkhai <ktkhai at parallels.com>
=============================================================================

Author: Kirill Tkhai
Email: ktkhai at parallels.com
Subject: sched: Hide steal time from inside CT
Date: Tue, 6 May 2014 18:27:40 +0400

https://jira.sw.ru/browse/PSBM-26587

>From the BUG by Khorenko Konstantin:

        "FastVPS complains on incorrect idle time calculation
        (PSBM-23431) and about high _steal_ time reported inside a CT.
        Steal time is a time when a CT was ready to run on a physical CPU,
        but the CPU was busy with processes which belong to another CT.
        => in case we have 10 CTs which eat cpu time as much as possible,
        then steal time in each CT will be 90% and execution time 10% only.

        i suggest to make steal time always shown as 0 inside Containers in
        order not to confuse end users.  At the same time steal time for
        Containers should be visible from the HN (host), this is useful for
        development.

        No objections neither from Pasha Emelyanov nor from Maik".

So, we do this for not to scare users.

Signed-off-by: Kirill Tkhai <ktkhai at parallels.com>

Acked-by: Vladimir Davydov <vdavydov at parallels.com>
=============================================================================

Author: Vladimir Davydov
Email: vdavydov at parallels.com
Subject: sched: cpu.proc.stat: always count nr_running and co on all cpus
Date: Wed, 30 Jul 2014 13:32:20 +0400

Currently we count them only on cpus 0..nr_vcpus, which is obviously
wrong, because those numbers are kept in a per-pcpu structure.

https://jira.sw.ru/browse/PSBM-28277

Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>

=============================================================================

Related to https://jira.sw.ru/browse/PSBM-33642

Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>

+++
ve/cpustat: don't try to update vcpustats for root_task_group

root_task_group doesn't have vcpu stats. Attempt to update them leads
to NULL-ptr deref:

        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff810b440c>] cpu_cgroup_update_vcpustat+0x13c/0x620
        ...
        Call Trace:
         [<ffffffff810bee3b>] cpu_cgroup_get_stat+0x7b/0x180
         [<ffffffff810f1ef7>] ve_get_cpu_stat+0x27/0x70
         [<ffffffffa01836a1>] fill_cpu_stat+0x91/0x1e0 [vzmon]
         [<ffffffffa0183c6b>] vzcalls_ioctl+0x2bb/0x430 [vzmon]
         [<ffffffffa018d0d5>] vzctl_ioctl+0x45/0x60 [vzdev]
         [<ffffffff8120cfb5>] do_vfs_ioctl+0x255/0x4f0
         [<ffffffff8120d2a4>] SyS_ioctl+0x54/0xa0
         [<ffffffff81642ac9>] system_call_fastpath+0x16/0x1b

So, return -ENOENT if we asked for vcpu stats of root_task_group.

https://jira.sw.ru/browse/PSBM-48721

Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
Reviewed-by: Vladimir Davydov <vdavydov at virtuozzo.com>

+++
ve/sched: take nr_cpus and cpu_rate from ve root task group

Patchset description:

ve: properly handle nr_cpus and cpu_rate for nested cgroups

https://jira.sw.ru/browse/PSBM-69678

Pavel Tikhomirov (3):
  cgroup: remove rcu_read_lock from cgroup_get_ve_root
  cgroup: make cgroup_get_ve_root visible in kernel/sched/core.c
  sched: take nr_cpus and cpu_rate from ve root task group

=============================================================
This patch description:
Cpu view in container should depend only from root cpu cgroup
nr_cpus/rate configuration. So replace tg->xxx references by
tg_xxx(tg) helpers to get xxx from root ve cgroup. We still
allow set/read rate and nr_cpus directly in nested cgroups,
but they are just converted to corresponding cfs_period and
cfs_quota setup, and does _not_ influence in container view
of cpus and their stats.

Also remove excessive rcu_read_lock/unlock as we have no rcu
dereference in between, looks like some leftover for task_group()
which differs in VZ6 and VZ7.

https://jira.sw.ru/browse/PSBM-69678

Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
Reviewed-by: Andrey Ryabinin <aryabinin at virtuozzo.com>

+++
ve/sched: remove no ve root cgroup warning

We can get to ve_root_tg() from host's cgroup so it is expected
to have no ve root cgroup for it. Call stack on task wakeup:

wake_up_process -> try_to_wake_up -> select_task_rq_fair
-> select_runnable_cpu -> check_cpulimit_spread -> tg_cpu_rate
-> ve_root_tg

Fixes: e661261 ("ve/sched: take nr_cpus and cpu_rate from ve root task group")

Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>

+++
ve/sched: take nr_cpus and cpu_rate from ve root task group

Patchset description:

ve: properly handle nr_cpus and cpu_rate for nested cgroups

https://jira.sw.ru/browse/PSBM-69678

Pavel Tikhomirov (3):
  cgroup: remove rcu_read_lock from cgroup_get_ve_root
  cgroup: make cgroup_get_ve_root visible in kernel/sched/core.c
  sched: take nr_cpus and cpu_rate from ve root task group

=============================================================
This patch description:

Cpu view in container should depend only from root cpu cgroup
nr_cpus/rate configuration. So replace tg->xxx references by
tg_xxx(tg) helpers to get xxx from root ve cgroup. We still
allow set/read rate and nr_cpus directly in nested cgroups,
but they are just converted to corresponding cfs_period and
cfs_quota setup, and does _not_ influence in container view
of cpus and their stats.

Also remove excessive rcu_read_lock/unlock as we have no rcu
dereference in between, looks like some leftover for task_group()
which differs in VZ6 and VZ7.

https://jira.sw.ru/browse/PSBM-69678

Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
Reviewed-by: Andrey Ryabinin <aryabinin at virtuozzo.com>

+++
ve/sched: remove no ve root cgroup warning

We can get to ve_root_tg() from host's cgroup so it is expected
to have no ve root cgroup for it. Call stack on task wakeup:

wake_up_process -> try_to_wake_up -> select_task_rq_fair
-> select_runnable_cpu -> check_cpulimit_spread -> tg_cpu_rate
-> ve_root_tg

Fixes: e661261 ("ve/sched: take nr_cpus and cpu_rate from ve root task group")

Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>

+++
sched: Fallback to 0-order allocations in sched_create_group()

On large machines ->cpustat_last/->vcpustat arrays are large
causing possibly failing high-order allocation:

       fio: page allocation failure: order:4, mode:0xc0d0

       Call Trace:
         dump_stack+0x19/0x1b
         warn_alloc_failed+0x110/0x180
          __alloc_pages_nodemask+0x7bf/0xc60
         alloc_pages_current+0x98/0x110
         kmalloc_order+0x18/0x40
         kmalloc_order_trace+0x26/0xa0
         __kmalloc+0x279/0x290
         sched_create_group+0xba/0x150
         sched_autogroup_create_attach+0x3f/0x1a0
         sys_setsid+0x73/0xc0
         system_call_fastpath+0x16/0x1b

Use kvzalloc to fallback to vmalloc() and avoid failure if
high order page is not available.

https://jira.sw.ru/browse/PSBM-79891
Fixes: 85fd6b2ff490 ("sched: Port cpustat related patches")

Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
Acked-by: Kirill Tkhai <ktkhai at virtuozzo.com>

=============================================================
proc/cpu/cgroup: make boottime in CT reveal the real start time

When we call 'ps axfw -o pid,stime' in container to show the start time
of processes of CT it looks like all of them are time travelers from the
future, these happens in case the CT was recently migrated before.

These happens because ps takes the start_time from /proc/<pid>/stat
which should be relative to the boottime, so ps adds it to the boottime
taken from /proc/stat and prints the results. But before these patch
the boottime in /proc/stat in CT is just the time of cgroup creation
(after migration), and the start_time in CT is relative to the real
boottime of CT (before the migration).

So make boottime in /proc/stat in CT be a real boottime of CT to fix
these mess. Collateral damage is that we would always see host boottime
in cpu.proc.stat files from host, but I don't think that cgroup creation
times made any sense there anyway.

https://jira.sw.ru/browse/PSBM-94263
Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>

=============================================================
sched: Account task_group::cpustat,taskstats,avenrun

Extracted from "Initial patch".

Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>

=============================================================
sched: Export per task_group statistics_work

loadavg, iowait etc.

Extracted from "Initial patch".

Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>

=============================================================
---
 kernel/sched/cpuacct.c | 159 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 159 insertions(+)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9f0ec721aec7..b1460db447e3 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -419,6 +419,51 @@ struct kernel_cpustat *cpuacct_cpustat(struct cgroup_subsys_state *css, int cpu)
 	return per_cpu_ptr(css_ca(css)->cpustat, cpu);
 }
 
+static void cpu_cgroup_update_stat(struct cgroup_subsys_state *cpu_css,
+				   struct cgroup_subsys_state *cpuacct_css,
+				   int i)
+{
+#if defined(CONFIG_SCHEDSTATS) && defined(CONFIG_FAIR_GROUP_SCHED)
+	struct task_group *tg = css_tg(cpu_css);
+	struct sched_entity *se = tg->se[i];
+	u64 *cpustat = cpuacct_cpustat(cpuacct_css, i)->cpustat;
+	u64 now = cpu_clock(i);
+	u64 delta, idle, iowait, steal, used;
+
+	/* root_task_group has not sched entities */
+	if (tg == &root_task_group)
+		return;
+
+	iowait = se->statistics.iowait_sum;
+	idle = se->statistics.sum_sleep_runtime;
+	steal = se->statistics.wait_sum;
+	used = se->sum_exec_runtime;
+
+	if (idle > iowait)
+		idle -= iowait;
+	else
+		idle = 0;
+
+	if (se->statistics.sleep_start) {
+		delta = now - se->statistics.sleep_start;
+		if ((s64)delta > 0)
+			idle += delta;
+	} else if (se->statistics.block_start) {
+		delta = now - se->statistics.block_start;
+		if ((s64)delta > 0)
+			iowait += delta;
+	} else if (se->statistics.wait_start) {
+		delta = now - se->statistics.wait_start;
+		if ((s64)delta > 0)
+			steal += delta;
+	}
+
+	cpustat[CPUTIME_IDLE]	= max(cpustat[CPUTIME_IDLE], idle);
+	cpustat[CPUTIME_IOWAIT]	= max(cpustat[CPUTIME_IOWAIT], iowait);
+	cpustat[CPUTIME_STEAL]	= steal;
+#endif
+}
+
 static void fixup_vcpustat_delta_usage(struct kernel_cpustat *cur,
 				       struct kernel_cpustat *rem, int ind,
 				       u64 cur_usage, u64 target_usage,
@@ -588,3 +633,117 @@ static void cpu_cgroup_update_vcpustat(struct cgroup_subsys_state *cpu_css,
 out_unlock:
 	spin_unlock(&tg->vcpustat_lock);
 }
+
+int cpu_cgroup_proc_stat(struct cgroup_subsys_state *cpu_css,
+			 struct cgroup_subsys_state *cpuacct_css,
+			 struct seq_file *p)
+{
+	int i;
+	s64 boot_sec;
+	u64 user, nice, system, idle, iowait, steal;
+	struct timespec64 boottime;
+	struct task_group *tg = css_tg(cpu_css);
+	bool virt = !ve_is_super(get_exec_env()) && tg != &root_task_group;
+	int nr_vcpus = tg_nr_cpus(tg);
+	struct kernel_cpustat *kcpustat;
+	unsigned long tg_nr_running = 0;
+	unsigned long tg_nr_iowait = 0;
+
+	getboottime64(&boottime);
+
+	/*
+	 * In VE0 we always show host's boottime and in VEX we show real CT
+	 * start time, even across CT migrations, as we rely on userspace to
+	 * set real_start_timespec for us on resume.
+	 */
+	boot_sec = boottime.tv_sec +
+		   get_exec_env()->real_start_time / NSEC_PER_SEC;
+
+	for_each_possible_cpu(i) {
+		cpu_cgroup_update_stat(cpu_css, cpuacct_css, i);
+
+		/* root task group has autogrouping, so this doesn't hold */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+		tg_nr_running += tg->cfs_rq[i]->h_nr_running;
+		tg_nr_iowait  += tg->cfs_rq[i]->nr_iowait;
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+		tg_nr_running += tg->rt_rq[i]->rt_nr_running;
+#endif
+	}
+
+	if (virt)
+		cpu_cgroup_update_vcpustat(cpu_css, cpuacct_css);
+
+	user = nice = system = idle = iowait = steal = 0;
+
+	for (i = 0; i < (virt ? nr_vcpus : nr_cpu_ids); i++) {
+		if (!virt && !cpu_possible(i))
+			continue;
+
+		kcpustat = virt ? tg->vcpustat + i :
+				  cpuacct_cpustat(cpuacct_css, i);
+
+		user	+= kcpustat->cpustat[CPUTIME_USER];
+		nice	+= kcpustat->cpustat[CPUTIME_NICE];
+		system	+= kcpustat->cpustat[CPUTIME_SYSTEM];
+		idle	+= kcpustat->cpustat[CPUTIME_IDLE];
+		iowait	+= kcpustat->cpustat[CPUTIME_IOWAIT];
+		steal	+= kcpustat->cpustat[CPUTIME_STEAL];
+	}
+	/* Don't scare CT users with high steal time */
+	if (!ve_is_super(get_exec_env()))
+		steal = 0;
+
+	seq_printf(p, "cpu  %llu %llu %llu %llu %llu 0 0 %llu\n",
+		   (unsigned long long)nsec_to_clock_t(user),
+		   (unsigned long long)nsec_to_clock_t(nice),
+		   (unsigned long long)nsec_to_clock_t(system),
+		   (unsigned long long)nsec_to_clock_t(idle),
+		   (unsigned long long)nsec_to_clock_t(iowait),
+		   virt ? 0ULL :
+		   (unsigned long long)nsec_to_clock_t(steal));
+
+	for (i = 0; i < (virt ? nr_vcpus : nr_cpu_ids); i++) {
+		if (!virt && !cpu_online(i))
+			continue;
+		kcpustat = virt ? tg->vcpustat + i :
+				  cpuacct_cpustat(cpuacct_css, i);
+
+		user	= kcpustat->cpustat[CPUTIME_USER];
+		nice	= kcpustat->cpustat[CPUTIME_NICE];
+		system	= kcpustat->cpustat[CPUTIME_SYSTEM];
+		idle	= kcpustat->cpustat[CPUTIME_IDLE];
+		iowait	= kcpustat->cpustat[CPUTIME_IOWAIT];
+		steal	= kcpustat->cpustat[CPUTIME_STEAL];
+		/* Don't scare CT users with high steal time */
+		if (!ve_is_super(get_exec_env()))
+			steal = 0;
+
+		seq_printf(p,
+			   "cpu%d %llu %llu %llu %llu %llu 0 0 %llu\n",
+			   i,
+			   (unsigned long long)nsec_to_clock_t(user),
+			   (unsigned long long)nsec_to_clock_t(nice),
+			   (unsigned long long)nsec_to_clock_t(system),
+			   (unsigned long long)nsec_to_clock_t(idle),
+			   (unsigned long long)nsec_to_clock_t(iowait),
+			   virt ? 0ULL :
+			   (unsigned long long)nsec_to_clock_t(steal));
+	}
+	seq_printf(p, "intr 0");
+
+	seq_printf(p,
+		   "\nctxt %llu\n"
+		   "btime %llu\n"
+		   "processes %lu\n"
+		   "procs_running %lu\n"
+		   "procs_blocked %lu\n",
+		   nr_context_switches(),
+		   (unsigned long long)boot_sec,
+		   total_forks,
+		   tg_nr_running,
+		   tg_nr_iowait);
+
+	return 0;
+}
-- 
2.28.0



More information about the Devel mailing list