[Devel] [PATCH RHEL8 COMMIT] ve: add per-ve CLOCK_MONOTONIC time via __vdso_gettimeofday()

Konstantin Khorenko khorenko at virtuozzo.com
Fri Jul 30 18:06:54 MSK 2021


The commit is pushed to "branch-rh8-4.18.0-305.3.1.vz8.7.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-305.3.1.vz8.7.1
------>
commit b0b80903261d28d5fa0f096ee3930d23c9bb5be5
Author: Andrey Ryabinin <aryabinin at virtuozzo.com>
Date:   Fri Jul 30 18:06:54 2021 +0300

    ve: add per-ve CLOCK_MONOTONIC time via __vdso_gettimeofday()
    
    Make possible to read virtualized container's CLOCK_MONOTONIC time
    via __vdso_gettimeofday(). Record containers start time in per-ve
    vdso and substruct it from the host's time on clock read.
    
    https://jira.sw.ru/browse/PSBM-121668
    
    Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
    
    Reviewed-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    khorenko@ notes:
    1) effectively we store in vdso area the same ve->start_time value.
       If a CT has been previously running, say 5 ns, we store in
       ve->start_time value (now - 5), so later monotonic_abs_to_ve()
       returns now - (ve->start_time) == now - (now - 5) == 5
    
    2) introduced timespec_sub_ns() function has "inline" attribute - it is
       fine.
       The stock timespec_add_ns() has "__always_inline" attribute, but the
       function is static, so there will be different copies of the function anyway
       even if the function is used in other files.
    
    3) timespec_sub_ns() is introduced for optimization:
       if we use timespec_add_ns(ns)+monotonic_time_to_ve(), there will be
       2 cycles of __iter_div_u64_rem().
    
    ===============================================
    The original vz7 commit message (f7188f105626):
    
        ve/vdso: virtualized monotonic gettime through vdso
    
        We already have infrastructure for virtualized vdso, however we use
        it only to change LINUX_VERSION_NAME in container. Simply store container's
        start time - ve->start_timespec in vdso variable - VDSO64_ve_start_timespec,
        and use it in __vdso_clock_gettime() to calculate container's monotonic time.
    
        Make uts_arch_setup_additional_pages()/uts_prep_vdso_pages_locked() to always
        setup new vdso, since previous policy to setup vdso only if uts_ns->name.releas
    e
        wouldn't work for virtualized __vdso_clock_gettime()
    
        https://jira.sw.ru/browse/PSBM-66451
    
        Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
        Reviewed-by: Dmitry Safonov <dsafonov at virtuozzo.com>
    
    +++
    x86_64, vclock_gettime: Use standart division instead of __iter_div_u64_rem()
    
    timespec_sub_ns() historically uses __iter_div_u64_rem() for division.
    Probably it's supposed to be faster
    
            /*
             * Iterative div/mod for use when dividend is not expected to be much
             * bigger than divisor.
             */
            u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
    
    However in our case ve_start_time may make dividend much bigger than divisor.
    So let's use standard "/" instead of iterative one. With 0 ve_start_time
    I wasn't able to see measurable difference, however with big ve_start_time
    the difference is rather significant:
    
     # time ./clock_iter_div
    real    1m30.224s
    user    1m30.343s
    sys     0m0.008s
    
     # time taskset ./clock_div
    real    0m2.757s
    user    0m1.730s
    sys     0m0.066s
    
    32-bit vdso doesn't like 64-bit division and doesn't link.
    I think it needs __udivsi3(). So just fallback to __iter_div_u64_rem()
    on 32-bit.
    
    mFixes: af2c78f571e6 ("ve: add per-ve CLOCK_MONOTONIC time via __vdso_gettimeofday()")
    
    https://jira.sw.ru/browse/PSBM-121856
    Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
    
    +++
    vdso, vclock_gettime: fix linking with old linkers
    
    On some old linkers vdso fails to build because of
    dynamic reloction of 've_start_time' symbol:
            VDSO2C  arch/x86/entry/vdso/vdso-image-64.c
            Error: vdso image contains dynamic relocations
    
    I was able to figure out why new linkers doesn't generate relocation
    while old ones does, but I did find out that visibility("hidden")
    attribute on 've_start_time' cures the problem.
    
    mFixes: af2c78f571e6 ("ve: add per-ve CLOCK_MONOTONIC time via
    __vdso_gettimeofday()")
    
    https://jira.sw.ru/browse/PSBM-121668
    Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
    
    (cherry picked from commit 0c7977348b6acaae464f64b8830204e0a0ee043c)
    
    This patch incompatible with time namespaces because it completely
    ignores their existance :)
    
    When we decide to step on time namespaces and set
    CONFIG_TIME_NS=y
    we have to drop this particular patch and rework userspace
    code to properly setup time namespace for container.
    
    Time namespaces natively handle all vdso time-related stuff.
    
    https://jira.sw.ru/browse/PSBM-131977
    
    Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn at virtuozzo.com>
---
 arch/x86/entry/vdso/vdso2c.c |  1 +
 arch/x86/include/asm/vdso.h  |  1 +
 kernel/ve/ve.c               | 14 +++++++++++
 lib/vdso/gettimeofday.c      | 57 ++++++++++++++++++++++++++++++++++++++------
 4 files changed, 66 insertions(+), 7 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index c291f7a1fa73..cce63b8e15ec 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -102,6 +102,7 @@ struct vdso_sym required_syms[] = {
 	{"__kernel_rt_sigreturn", true},
 	{"int80_landing_pad", true},
 	{"linux_version_code", true},
+	{"ve_start_time", true},
 };
 
 __attribute__((format(printf, 1, 2))) __attribute__((noreturn))
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 87c5e0df1d9d..7c1fcfa8bc3a 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -32,6 +32,7 @@ struct vdso_image {
 	RH_KABI_EXTEND(unsigned long extable_len)
 	RH_KABI_EXTEND(const void *extable)
 	long sym_linux_version_code;
+	long sym_ve_start_time;
 };
 
 #ifdef CONFIG_X86_64
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index dd0faf95452e..017435152689 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -646,6 +646,17 @@ void ve_rm_from_release_list(struct cgroup *cgrp)
 	rcu_read_unlock();
 }
 
+static void ve_set_vdso_time(struct ve_struct *ve, u64 time)
+{
+	u64 *vdso_start_time;
+
+	vdso_start_time = ve->vdso_64->data + ve->vdso_64->sym_ve_start_time;
+	*vdso_start_time = time;
+
+	vdso_start_time = ve->vdso_32->data + ve->vdso_32->sym_ve_start_time;
+	*vdso_start_time = time;
+}
+
 /* under ve->op_sem write-lock */
 static int ve_start_container(struct ve_struct *ve)
 {
@@ -680,6 +691,8 @@ static int ve_start_container(struct ve_struct *ve)
 	if (ve->start_time == 0) {
 		ve->start_time = tsk->start_time;
 		ve->real_start_time = tsk->real_start_time;
+
+		ve_set_vdso_time(ve, ve->start_time);
 	}
 	/* The value is wrong, but it is never compared to process
 	 * start times */
@@ -1409,6 +1422,7 @@ static ssize_t ve_ts_write(struct kernfs_open_file *of, char *buf,
 		case VE_CF_CLOCK_MONOTONIC:
 			now = ktime_get_ns();
 			target = &ve->start_time;
+			ve_set_vdso_time(ve, now - delta_ns);
 			break;
 		case VE_CF_CLOCK_BOOTBASED:
 			now = ktime_get_boot_ns();
diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index c8cfdc6bd2f5..7485508c9cc7 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -6,6 +6,8 @@
 #include <vdso/datapage.h>
 #include <vdso/helpers.h>
 
+u64 ve_start_time 	__attribute__((visibility("hidden")));
+
 #ifndef vdso_calc_delta
 /*
  * Default implementation which works for all sane clocksources. That
@@ -46,6 +48,35 @@ static inline bool vdso_cycles_ok(u64 cycles)
 }
 #endif
 
+static inline u64 divu64(u64 dividend, u32 divisor, u64 *remainder)
+{
+	/* 32-bit wants __udivsi3() and fails to link, so fallback to iter */
+#ifndef BUILD_VDSO32
+	u64 res;
+
+	res = dividend/divisor;
+	*remainder = dividend % divisor;
+	return res;
+#else
+	return __iter_div_u64_rem(dividend, divisor, remainder);
+#endif
+}
+
+static inline void timespec_sub_ns(struct __kernel_timespec *ts, u64 ns)
+{
+	if ((s64)ns <= 0) {
+		ts->tv_sec += divu64(-ns, NSEC_PER_SEC, &ns);
+		ts->tv_nsec = ns;
+	} else {
+		ts->tv_sec -= divu64(ns, NSEC_PER_SEC, &ns);
+		if (ns) {
+			ts->tv_sec--;
+			ns = NSEC_PER_SEC - ns;
+		}
+		ts->tv_nsec = ns;
+	}
+}
+
 #ifdef CONFIG_TIME_NS
 static int do_hres_timens(const struct vdso_data *vdns, clockid_t clk,
 			  struct __kernel_timespec *ts)
@@ -149,12 +180,17 @@ static __always_inline int do_hres(const struct vdso_data *vd, clockid_t clk,
 		sec = vdso_ts->sec;
 	} while (unlikely(vdso_read_retry(vd, seq)));
 
-	/*
-	 * Do this outside the loop: a race inside the loop could result
-	 * in __iter_div_u64_rem() being extremely slow.
-	 */
-	ts->tv_sec = sec + __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
-	ts->tv_nsec = ns;
+	if (clk != CLOCK_MONOTONIC) {
+		/*
+		 * Do this outside the loop: a race inside the loop could result
+		 * in __iter_div_u64_rem() being extremely slow.
+		 */
+		ts->tv_sec = sec + __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
+		ts->tv_nsec = ns;
+	} else {
+		ts->tv_sec = sec;
+		timespec_sub_ns(ts, ve_start_time - ns);
+	}
 
 	return 0;
 }
@@ -201,6 +237,7 @@ static __always_inline int do_coarse(const struct vdso_data *vd, clockid_t clk,
 {
 	const struct vdso_timestamp *vdso_ts = &vd->basetime[clk];
 	u32 seq;
+	u64 ns;
 
 	do {
 		/*
@@ -216,9 +253,15 @@ static __always_inline int do_coarse(const struct vdso_data *vd, clockid_t clk,
 		smp_rmb();
 
 		ts->tv_sec = vdso_ts->sec;
-		ts->tv_nsec = vdso_ts->nsec;
+		if (clk != CLOCK_MONOTONIC_COARSE)
+			ts->tv_nsec = vdso_ts->nsec;
+		else
+			ns = vdso_ts->nsec;
 	} while (unlikely(vdso_read_retry(vd, seq)));
 
+	if (clk == CLOCK_MONOTONIC_COARSE)
+		timespec_sub_ns(ts, ve_start_time - ns);
+
 	return 0;
 }
 


More information about the Devel mailing list