[Devel] [PATCH RHEL8 COMMIT] ve/posix-timers: reference ve monotonic clock from ve start (v2)

Konstantin Khorenko khorenko at virtuozzo.com
Wed Oct 14 15:52:17 MSK 2020


The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.12
------>
commit 5f2799f1c2d5997984d15ffc901314eedc6b1295
Author: Kirill Tkhai <ktkhai at parallels.com>
Date:   Tue Aug 5 13:22:35 2014 +0400

    ve/posix-timers: reference ve monotonic clock from ve start (v2)
    
    So that CLOCK_MONOTONIC will be monotonic even if ve is migrated to
    another hw node.
    
    Note, translating ve <-> abs time in clock_settime and timer_settime is
    not necessary because (1) clock_settime won't set monotonic clock and
    (2) timer_gettime always returns relative time.
    
    https://jira.sw.ru/browse/PSBM-13860
    
    diff-posix_timers-reference-ct-monotonic-clock-from-ct-start
    
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
    
    Acked-by: Pavel Emelyanov <xemul at parallels.com>
    Signed-off-by: Kirill Tkhai <ktkhai at parallels.com>
    
    +++
    ve/posix-timers: reference ve monotonic clock from start in clock_nanosleep
    
    This is an addition to diff-posix_timers-reference-ct-monotonic-clock-from-ct-start
    
    Otherwise, apps that use sys_clock_nanosleep() to suspend their
    execution can hang after ve migration.
    
    diff-posix-timers-reference-ve-monotonic-clock-from-ve-start-in-clock_nanosleep
    
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
    
    Acked-by: Konstantin Khlebnikov <khlebnikov at openvz.org>
    Acked-by: Pavel Emelyanov <xemul at parallels.com>
    Signed-off-by: Kirill Tkhai <ktkhai at parallels.com>
    
    +++
    timers: Port diff-ve-timers-convert-ve-monotonic-to-abs-time-when-setting-timerfd-2
    
    Need this for docker, as sometimes systemd-tmpfiles-clean.timer inside
    a PCS7 CT is spamming dbus with requests to start corresponding service.
    And at the same time Docker tries to create cgroup for container and
    attach it to hierarchies like memory and blkio.
    
    That is because systemd timer was triggered with non-virtualized timerfd
    using plain host clock but check that timer is successfull uses
    virtualized clock_gettime and don't pass before proper(in-container)
    timer activation. And timers charges again and again starts service
    got in busy loop.
    
    https://jira.sw.ru/browse/PSBM-34017
    
    v2: move the stubs to ve.h
    
    Port the following RH6 commit:
    
      Author: Vladimir Davydov
      Email: vdavydov at parallels.com
      Subject: fs: convert ve monotonic to abs time when setting timerfd
      Date: Fri, 15 Feb 2013 11:57:09 +0400
    
      * [timers] corrected TFD_TIMER_ABSTIME timer handling,
        the issue led to high cpu usage inside a Fedora 18 CT
        by 'init' process (PSBM-18284)
    
      Monotonic time inside container, as it can be obtained using various
      system calls such as clock_gettime, is reported since start of the container,
      not since start of the whole system. This was made in order to avoid time
      issues while a container is migrated between different physical hosts, but this
      also introduced a lot of problems in time- related system calls because
      absolute monotonic time, which is in fact relative to container, passed to those
      system calls must be converted to system-wide monotonic time, which is used by
      kernel hrtimers.
    
      One of those buggy system calls is timerfd_settime which accepts as an
      argument absolute time if flag TFD_TIMER_ABSTIME is specified.
    
      The patch fixes it by converting container monotonic time to system-
      wide monotonic time using the monotonic_ve_to_abs() function, which was
      introduced earlier and is now exported for that reason.
    
      https://jira.sw.ru/browse/PSBM-18284
    
      Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
    
    Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
    Signed-off-by: Kirill Tkhai <ktkhai at odin.com>
    Reviewed-by: Vladimir Davydov <vdavydov at parallels.com>
    
    (cherry picked from vz7 commit 869542c24c41c0578b47d2ef83cfa63427e0e5e1)
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    +++
    timers should not get negative argument
    
    This patch fixes 25-sec delay on login into systemd based containers.
    
    Userspace application can set timer for past
    and expect that the timer will be expired immediately.
    
    This can do not work as expected inside migrated containers.
    Translated argument provided to timer can become negative,
    and according timer will sleep a very long time.
    
    https://jira.sw.ru/browse/PSBM-48475
    
    CC: Vladimir Davydov <vdavydov at virtuozzo.com>
    CC: Konstantin Khorenko <khorenko at virtuozzo.com>
    Signed-off-by: Vasily Averin <vvs at virtuozzo.com>
    Acked-by: Cyrill Gorcunov <gorcunov at openvz.org>
    
    (cherry picked from vz7 commit a71fa19facb00472e47760255ab2e6fa16885732)
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
---
 fs/timerfd.c               |  8 +++++--
 include/linux/ve.h         |  8 +++++++
 kernel/time/posix-timers.c | 54 ++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 66 insertions(+), 4 deletions(-)

diff --git a/fs/timerfd.c b/fs/timerfd.c
index cdad49da3ff7..59ed38c29941 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -26,6 +26,7 @@
 #include <linux/syscalls.h>
 #include <linux/compat.h>
 #include <linux/rcupdate.h>
+#include <linux/ve.h>
 
 struct timerfd_ctx {
 	union {
@@ -432,8 +433,8 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
 	return ufd;
 }
 
-static int do_timerfd_settime(int ufd, int flags, 
-		const struct itimerspec64 *new,
+static int do_timerfd_settime(int ufd, int flags,
+		struct itimerspec64 *new,
 		struct itimerspec64 *old)
 {
 	struct fd f;
@@ -493,6 +494,9 @@ static int do_timerfd_settime(int ufd, int flags,
 	/*
 	 * Re-program the timer to the new value ...
 	 */
+	if ((flags & TFD_TIMER_ABSTIME) &&
+	    (new->it_value.tv_sec || new->it_value.tv_nsec))
+		monotonic_ve_to_abs(ctx->clockid, &new->it_value);
 	ret = timerfd_setup(ctx, flags, new);
 
 	spin_unlock_irq(&ctx->wqh.lock);
diff --git a/include/linux/ve.h b/include/linux/ve.h
index 0db98e8e08c1..ec7dc522ac1f 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -129,6 +129,9 @@ static inline struct ve_struct *css_to_ve(struct cgroup_subsys_state *css)
 
 extern struct cgroup_subsys_state *ve_get_init_css(struct ve_struct *ve, int subsys_id);
 
+extern void monotonic_abs_to_ve(clockid_t which_clock, struct timespec64 *tp);
+extern void monotonic_ve_to_abs(clockid_t which_clock, struct timespec64 *tp);
+
 #define ve_feature_set(ve, f)			\
 	!!((ve)->features & VE_FEATURE_##f)
 
@@ -166,6 +169,11 @@ static inline struct user_namespace *ve_init_user_ns(void)
 static inline int vz_security_family_check(struct net *net, int family) { return 0; }
 static inline int vz_security_protocol_check(struct net *net, int protocol) { return 0; }
 
+static inline void monotonic_abs_to_ve(clockid_t which_clock,
+				       struct timespec64 *tp) { }
+static inline void monotonic_ve_to_abs(clockid_t which_clock,
+				       struct timepsec64 *tp) { }
+
 #endif	/* CONFIG_VE */
 
 #endif /* _LINUX_VE_H */
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 3d54de0a28bc..ddc848490263 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -51,6 +51,7 @@
 #include <linux/hashtable.h>
 #include <linux/compat.h>
 #include <linux/nospec.h>
+#include <linux/ve.h>
 
 #include "timekeeping.h"
 #include "posix-timers.h"
@@ -76,6 +77,46 @@ static const struct k_clock * const posix_clocks[];
 static const struct k_clock *clockid_to_kclock(const clockid_t id);
 static const struct k_clock clock_realtime, clock_monotonic;
 
+#define clock_is_monotonic(which_clock) \
+	((which_clock) == CLOCK_MONOTONIC || \
+	 (which_clock) == CLOCK_MONOTONIC_RAW || \
+	 (which_clock) == CLOCK_MONOTONIC_COARSE)
+
+#ifdef CONFIG_VE
+static struct timespec64 zero_time;
+
+void monotonic_abs_to_ve(clockid_t which_clock, struct timespec64 *tp)
+{
+	struct timespec64 offset;
+
+	if (!clock_is_monotonic(which_clock))
+		return;
+
+	offset = ns_to_timespec64(get_exec_env()->start_time);
+	set_normalized_timespec64(tp,
+				  tp->tv_sec - offset.tv_sec,
+				  tp->tv_nsec - offset.tv_nsec);
+}
+
+void monotonic_ve_to_abs(clockid_t which_clock, struct timespec64 *tp)
+{
+	struct timespec64 offset;
+
+	if (!clock_is_monotonic(which_clock))
+		return;
+
+	offset = ns_to_timespec64(get_exec_env()->start_time);
+	set_normalized_timespec64(tp,
+				  tp->tv_sec + offset.tv_sec,
+				  tp->tv_nsec + offset.tv_nsec);
+
+	if (timespec64_compare(tp, &zero_time) <= 0) {
+		tp->tv_sec =  0;
+		tp->tv_nsec = 1;
+	}
+}
+#endif
+
 /*
  * we assume that the new SIGEV_THREAD_ID shares no bits with the other
  * SIGEV values.  Here we put out an error if this assumption fails.
@@ -895,6 +936,9 @@ static int do_timer_settime(timer_t timer_id, int flags,
 	if (!timr)
 		return -EINVAL;
 
+	if ((flags & TIMER_ABSTIME) &&
+	    (new_spec64->it_value.tv_sec || new_spec64->it_value.tv_nsec))
+		monotonic_ve_to_abs(timr->it_clock, &new_spec64->it_value);
 	kc = timr->kclock;
 	if (WARN_ON_ONCE(!kc || !kc->timer_set))
 		error = -EINVAL;
@@ -1072,6 +1116,7 @@ SYSCALL_DEFINE2(clock_gettime, const clockid_t, which_clock,
 
 	error = kc->clock_get(which_clock, &kernel_tp);
 
+	monotonic_abs_to_ve(which_clock, &kernel_tp);
 	if (!error && put_timespec64(&kernel_tp, tp))
 		error = -EFAULT;
 
@@ -1148,6 +1193,7 @@ COMPAT_SYSCALL_DEFINE2(clock_gettime, clockid_t, which_clock,
 
 	err = kc->clock_get(which_clock, &ts);
 
+	monotonic_abs_to_ve(which_clock, &ts);
 	if (!err && compat_put_timespec64(&ts, tp))
 		err = -EFAULT;
 
@@ -1233,8 +1279,10 @@ SYSCALL_DEFINE4(clock_nanosleep, const clockid_t, which_clock, int, flags,
 
 	if (!timespec64_valid(&t))
 		return -EINVAL;
-	if (flags & TIMER_ABSTIME)
+	if (flags & TIMER_ABSTIME) {
+		monotonic_ve_to_abs(which_clock, &t);
 		rmtp = NULL;
+	}
 	current->restart_block.nanosleep.type = rmtp ? TT_NATIVE : TT_NONE;
 	current->restart_block.nanosleep.rmtp = rmtp;
 
@@ -1260,8 +1308,10 @@ COMPAT_SYSCALL_DEFINE4(clock_nanosleep, clockid_t, which_clock, int, flags,
 
 	if (!timespec64_valid(&t))
 		return -EINVAL;
-	if (flags & TIMER_ABSTIME)
+	if (flags & TIMER_ABSTIME) {
+		monotonic_ve_to_abs(which_clock, &t);
 		rmtp = NULL;
+	}
 	current->restart_block.nanosleep.type = rmtp ? TT_COMPAT : TT_NONE;
 	current->restart_block.nanosleep.compat_rmtp = rmtp;
 


More information about the Devel mailing list