[Devel] [PATCH RHEL COMMIT] ve: Add interface for ve::clock_[monotonic|bootbased] adjustment
Konstantin Khorenko
khorenko at virtuozzo.com
Mon Oct 4 21:53:18 MSK 2021
The commit is pushed to "branch-rh9-5.14.vz9.1.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after ark-5.14
------>
commit fb82551ca2f34beb9ccfadec8b2ba5a2fb41998c
Author: Cyrill Gorcunov <gorcunov at virtuozzo.com>
Date: Mon Oct 4 21:53:18 2021 +0300
ve: Add interface for ve::clock_[monotonic|bootbased] adjustment
This two members represent monotonic and bootbased clocks for
container's uptime. When container is in suspended state (or
moving to another node) we trest monotonic and bootbased
clocks as being stopped so we need to account delta time
on restore and adjust the members in subject.
Moreover this timestamps are involved into posix-timers
setup so once application tries to setup monotonic clocks
after the restore (with absolute time specification) we
adjust the values as well.
The application which migrate a container must fetch
the current settings from /sys/fs/cgroup/ve/$VE/ve.real_start_timespec
and /sys/fs/cgroup/ve/$VE/ve.start_timespec, then write them
back on the restore.
https://jira.sw.ru/browse/PSBM-41311
https://jira.sw.ru/browse/PSBM-41406
v2:
- use clock_[monotonic|bootbased] for cgroup entry names instead
Original-by: Andrew Vagin <avagin at openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov at virtuozzo.com>
Reviewed-by: Vladimir Davydov <vdavydov at virtuozzo.com>
(cherry picked from vz7 commit 43f4b0c752abd84aa1b346373d152941123d2446
("ve: Add interface for @start_timespec and @real_start_timespec
adjustmen"))
Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
+++
ve/time: Limit values to write in ve::clock_[monotonic|bootbased]
What do we mean when write a valie XXX into, say, ve::ve.clock_bootbased?
We mean that "up to now the CT worked for XXX secs/usecs already".
And we store the delta between Node "now" and XXX into ve->start_time_real.
If the CT worked less than the current Node, ve->start_time_real will
contain positive value and we'll substitute it from Node's "now" each
time when we need to get the time since the CT start.
If the CT worked longer than the current CT (say, CT has been migrated
from another HN), the stored delta will be negative and thus we'll "add"
more time for Node's "now".
So then what do we want to limit?
1. Negative values written to ve::clock_[monotonic|bootbased].
Indeed we can hardly imagine that the CT has been started, but the
time since it's start is negative.
2. A big positive value, so some time later when we read from
ve::clock_[monotonic|bootbased] we get an overflowed value.
Both these checks are performed by timespec_valid_strict().
mFixes: 25cab3041305 ("ve: Add interface for
ve::clock_[monotonic|bootbased] adjustment")
Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
Reviewed-by: Kirill Tkhai <ktkhai at virtuozzo.com>
Only leave ve.clock_* readonly, configuration of ve time offsets now
should be done via time namespace.
(cherry picked from vz8 commit ad5d9cc5fd627579b56c12b4609523fcf4a7bde6)
Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
====================
Patchset description:
ve/time: switch from our ve-time to native timenamespace
https://jira.sw.ru/browse/PSBM-134393
As time-namespaces are a new and mainstreamed version of ve-time, it's
time to switch to it.
Notes:
1) ve-time does not need configuration on start, though time namespace
needs configuration (offset == -now).
2) ve-time saved container start time but time namespaces save offset
between host start time and container start time
(offset == ve_start_time - now).
3) criu already knows how to handle time namespaces, though we need to
do a compatibility layer to convert our ve.clock_* to offsets in time
namespace for pre-vz9 to vz9 migration.
4) vdso time is already handled by time namespaces, though time
namespace only virtualizes vvar page, so it should not intersect with
our vdso virtualization for ve.os_release.
https://jira.sw.ru/browse/PSBM-134393
Cyrill Gorcunov (1):
ve: Add interface for ve::clock_[monotonic|bootbased] adjustment
Kirill Tkhai (2):
ve/time: Use ve_relative_clock in times() syscall and /proc/[pid]/stat
ve: Virtualize sysinfo
Pavel Tikhomirov (1):
ve/time: remove our per-ve times in favor of mainstream
time-namespaces
Valeriy Vdovin (1):
ve/proc: Added separate start time field to task_struct to show in
container
---
include/linux/time_namespace.h | 2 ++
kernel/time/namespace.c | 2 +-
kernel/ve/ve.c | 63 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 66 insertions(+), 1 deletion(-)
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index 3146f1c056c9..588941c29236 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -28,6 +28,8 @@ struct time_namespace {
extern struct time_namespace init_time_ns;
+extern struct mutex offset_lock;
+
#ifdef CONFIG_TIME_NS
extern int vdso_join_timens(struct task_struct *task,
struct time_namespace *ns);
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index aec832801c26..ec9a00f45dbe 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -196,7 +196,7 @@ static void timens_setup_vdso_data(struct vdso_data *vdata,
* Protects possibly multiple offsets writers racing each other
* and tasks entering the namespace.
*/
-static DEFINE_MUTEX(offset_lock);
+DEFINE_MUTEX(offset_lock);
static void timens_set_vvar_page(struct task_struct *task,
struct time_namespace *ns)
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index b77429f10df2..21eec9973a9a 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -24,6 +24,7 @@
#include <linux/kthread.h>
#include <linux/nsproxy.h>
#include <linux/fs_struct.h>
+#include <linux/time_namespace.h>
#include <linux/genhd.h>
#include <uapi/linux/vzcalluser.h>
@@ -1046,6 +1047,56 @@ static ssize_t ve_os_release_write(struct kernfs_open_file *of, char *buf,
return ret ? ret : nbytes;
}
+enum {
+ VE_CF_CLOCK_MONOTONIC,
+ VE_CF_CLOCK_BOOTBASED,
+};
+
+static int ve_ts_read(struct seq_file *sf, void *v)
+{
+ struct ve_struct *ve = css_to_ve(seq_css(sf));
+ struct nsproxy *ve_ns;
+ struct time_namespace *time_ns;
+ struct timespec64 tp = ns_to_timespec64(0);
+ struct timespec64 *offset = NULL;
+
+ rcu_read_lock();
+ ve_ns = rcu_dereference(ve->ve_ns);
+ if (!ve_ns) {
+ rcu_read_unlock();
+ goto out;
+ }
+
+ time_ns = get_time_ns(ve_ns->time_ns);
+ rcu_read_unlock();
+
+ switch (seq_cft(sf)->private) {
+ case VE_CF_CLOCK_MONOTONIC:
+ ktime_get_ts64(&tp);
+ offset = &time_ns->offsets.monotonic;
+ break;
+ case VE_CF_CLOCK_BOOTBASED:
+ ktime_get_boottime_ts64(&tp);
+ offset = &time_ns->offsets.boottime;
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ goto out_ns;
+ }
+
+ /*
+ * Note: ve.clock_* fields should report ve-relative time, but timens
+ * offsets instead report the offset between ns-relative time and host
+ * time, so we need to print offset+now to show ve-relative time.
+ */
+ tp = timespec64_add(tp, *offset);
+out_ns:
+ put_time_ns(time_ns);
+out:
+ seq_printf(sf, "%lld %ld", tp.tv_sec, tp.tv_nsec);
+ return 0;
+}
+
static int ve_mount_opts_read(struct seq_file *sf, void *v)
{
struct ve_struct *ve = css_to_ve(seq_css(sf));
@@ -1213,6 +1264,18 @@ static struct cftype ve_cftypes[] = {
.read_u64 = ve_reatures_read,
.write_u64 = ve_reatures_write,
},
+ {
+ .name = "clock_monotonic",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = ve_ts_read,
+ .private = VE_CF_CLOCK_MONOTONIC,
+ },
+ {
+ .name = "clock_bootbased",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = ve_ts_read,
+ .private = VE_CF_CLOCK_BOOTBASED,
+ },
{
.name = "netns_max_nr",
.flags = CFTYPE_NOT_ON_ROOT,
More information about the Devel
mailing list