[Devel] [PATCH RHEL COMMIT] ve/posix-timers: reference ve monotonic clock from ve start (v2)

Konstantin Khorenko khorenko at virtuozzo.com
Mon Oct 4 16:34:36 MSK 2021


reverted

https://jira.sw.ru/browse/PSBM-134393

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 01.10.2021 19:38, Konstantin Khorenko wrote:
> The commit is pushed to "branch-rh9-5.14.vz9.1.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
> after ark-5.14
> ------>
> commit ea99ce6ef1e0d2ec464b015ab3c76a58a4a907f1
> Author: Kirill Tkhai <ktkhai at parallels.com>
> Date:   Fri Oct 1 19:38:40 2021 +0300
> 
>      ve/posix-timers: reference ve monotonic clock from ve start (v2)
>      
>      So that CLOCK_MONOTONIC will be monotonic even if ve is migrated to
>      another hw node.
>      
>      Note, translating ve <-> abs time in clock_settime and timer_settime is
>      not necessary because (1) clock_settime won't set monotonic clock and
>      (2) timer_gettime always returns relative time.
>      
>      https://jira.sw.ru/browse/PSBM-13860
>      
>      diff-posix_timers-reference-ct-monotonic-clock-from-ct-start
>      
>      Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
>      
>      Acked-by: Pavel Emelyanov <xemul at parallels.com>
>      Signed-off-by: Kirill Tkhai <ktkhai at parallels.com>
>      
>      +++
>      ve/posix-timers: reference ve monotonic clock from start in clock_nanosleep
>      
>      This is an addition to diff-posix_timers-reference-ct-monotonic-clock-from-ct-start
>      
>      Otherwise, apps that use sys_clock_nanosleep() to suspend their
>      execution can hang after ve migration.
>      
>      diff-posix-timers-reference-ve-monotonic-clock-from-ve-start-in-clock_nanosleep
>      
>      Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
>      
>      Acked-by: Konstantin Khlebnikov <khlebnikov at openvz.org>
>      Acked-by: Pavel Emelyanov <xemul at parallels.com>
>      Signed-off-by: Kirill Tkhai <ktkhai at parallels.com>
>      
>      +++
>      timers: Port diff-ve-timers-convert-ve-monotonic-to-abs-time-when-setting-timerfd-2
>      
>      Need this for docker, as sometimes systemd-tmpfiles-clean.timer inside
>      a PCS7 CT is spamming dbus with requests to start corresponding service.
>      And at the same time Docker tries to create cgroup for container and
>      attach it to hierarchies like memory and blkio.
>      
>      That is because systemd timer was triggered with non-virtualized timerfd
>      using plain host clock but check that timer is successfull uses
>      virtualized clock_gettime and don't pass before proper(in-container)
>      timer activation. And timers charges again and again starts service
>      got in busy loop.
>      
>      https://jira.sw.ru/browse/PSBM-34017
>      
>      v2: move the stubs to ve.h
>      
>      Port the following RH6 commit:
>      
>        Author: Vladimir Davydov
>        Email: vdavydov at parallels.com
>        Subject: fs: convert ve monotonic to abs time when setting timerfd
>        Date: Fri, 15 Feb 2013 11:57:09 +0400
>      
>        * [timers] corrected TFD_TIMER_ABSTIME timer handling,
>          the issue led to high cpu usage inside a Fedora 18 CT
>          by 'init' process (PSBM-18284)
>      
>        Monotonic time inside container, as it can be obtained using various
>        system calls such as clock_gettime, is reported since start of the container,
>        not since start of the whole system. This was made in order to avoid time
>        issues while a container is migrated between different physical hosts, but this
>        also introduced a lot of problems in time- related system calls because
>        absolute monotonic time, which is in fact relative to container, passed to those
>        system calls must be converted to system-wide monotonic time, which is used by
>        kernel hrtimers.
>      
>        One of those buggy system calls is timerfd_settime which accepts as an
>        argument absolute time if flag TFD_TIMER_ABSTIME is specified.
>      
>        The patch fixes it by converting container monotonic time to system-
>        wide monotonic time using the monotonic_ve_to_abs() function, which was
>        introduced earlier and is now exported for that reason.
>      
>        https://jira.sw.ru/browse/PSBM-18284
>      
>        Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
>      
>      Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
>      
>      Signed-off-by: Kirill Tkhai <ktkhai at odin.com>
>      
>      Reviewed-by: Vladimir Davydov <vdavydov at parallels.com>
>      
>      (cherry picked from vz7 commit 869542c24c41c0578b47d2ef83cfa63427e0e5e1)
>      Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
>      
>      +++
>      timers should not get negative argument
>      
>      This patch fixes 25-sec delay on login into systemd based containers.
>      
>      Userspace application can set timer for past
>      and expect that the timer will be expired immediately.
>      
>      This can do not work as expected inside migrated containers.
>      Translated argument provided to timer can become negative,
>      and according timer will sleep a very long time.
>      
>      https://jira.sw.ru/browse/PSBM-48475
>      
>      CC: Vladimir Davydov <vdavydov at virtuozzo.com>
>      CC: Konstantin Khorenko <khorenko at virtuozzo.com>
>      Signed-off-by: Vasily Averin <vvs at virtuozzo.com>
>      
>      Acked-by: Cyrill Gorcunov <gorcunov at openvz.org>
>      
>      (cherry picked from vz7 commit a71fa19facb00472e47760255ab2e6fa16885732)
>      Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
>      
>      (cherry-picked from vz8 commit 4dad672974ab ("ve/posix-timers: reference
>      ve monotonic clock from ve start (v2)"))
>      
>      Signed-off-by: Nikita Yushchenko <nikita.yushchenko at virtuozzo.com>
> ---
>   fs/timerfd.c               |  8 +++++--
>   include/linux/ve.h         |  8 +++++++
>   kernel/time/posix-timers.c | 54 ++++++++++++++++++++++++++++++++++++++++++++--
>   3 files changed, 66 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/timerfd.c b/fs/timerfd.c
> index c5509d2448e3..4f782f3690d0 100644
> --- a/fs/timerfd.c
> +++ b/fs/timerfd.c
> @@ -27,6 +27,7 @@
>   #include <linux/compat.h>
>   #include <linux/rcupdate.h>
>   #include <linux/time_namespace.h>
> +#include <linux/ve.h>
>   
>   struct timerfd_ctx {
>   	union {
> @@ -435,8 +436,8 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
>   	return ufd;
>   }
>   
> -static int do_timerfd_settime(int ufd, int flags,
> -		const struct itimerspec64 *new,
> +static int do_timerfd_settime(int ufd, int flags,
> +		struct itimerspec64 *new,
>   		struct itimerspec64 *old)
>   {
>   	struct fd f;
> @@ -500,6 +501,9 @@ static int do_timerfd_settime(int ufd, int flags,
>   	/*
>   	 * Re-program the timer to the new value ...
>   	 */
> +	if ((flags & TFD_TIMER_ABSTIME) &&
> +	    (new->it_value.tv_sec || new->it_value.tv_nsec))
> +		monotonic_ve_to_abs(ctx->clockid, &new->it_value);
>   	ret = timerfd_setup(ctx, flags, new);
>   
>   	spin_unlock_irq(&ctx->wqh.lock);
> diff --git a/include/linux/ve.h b/include/linux/ve.h
> index 552fa577e2f9..19a590bc86d4 100644
> --- a/include/linux/ve.h
> +++ b/include/linux/ve.h
> @@ -97,6 +97,9 @@ static inline struct ve_struct *css_to_ve(struct cgroup_subsys_state *css)
>   
>   extern struct cgroup_subsys_state *ve_get_init_css(struct ve_struct *ve, int subsys_id);
>   
> +extern void monotonic_abs_to_ve(clockid_t which_clock, struct timespec64 *tp);
> +extern void monotonic_ve_to_abs(clockid_t which_clock, struct timespec64 *tp);
> +
>   #define ve_feature_set(ve, f)			\
>   	!!((ve)->features & VE_FEATURE_##f)
>   
> @@ -140,6 +143,11 @@ static inline struct cgroup *cgroup_get_ve_root1(struct cgroup *cgrp)
>   static inline int vz_security_family_check(struct net *net, int family, int type) { return 0; }
>   static inline int vz_security_protocol_check(struct net *net, int protocol) { return 0; }
>   
> +static inline void monotonic_abs_to_ve(clockid_t which_clock,
> +				       struct timespec64 *tp) { }
> +static inline void monotonic_ve_to_abs(clockid_t which_clock,
> +				       struct timepsec64 *tp) { }
> +
>   #endif	/* CONFIG_VE */
>   
>   #endif /* _LINUX_VE_H */
> diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
> index 7363f81dc31a..3f91756f5421 100644
> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -31,6 +31,7 @@
>   #include <linux/compat.h>
>   #include <linux/nospec.h>
>   #include <linux/time_namespace.h>
> +#include <linux/ve.h>
>   
>   #include "timekeeping.h"
>   #include "posix-timers.h"
> @@ -56,6 +57,46 @@ static const struct k_clock * const posix_clocks[];
>   static const struct k_clock *clockid_to_kclock(const clockid_t id);
>   static const struct k_clock clock_realtime, clock_monotonic;
>   
> +#define clock_is_monotonic(which_clock) \
> +	((which_clock) == CLOCK_MONOTONIC || \
> +	 (which_clock) == CLOCK_MONOTONIC_RAW || \
> +	 (which_clock) == CLOCK_MONOTONIC_COARSE)
> +
> +#ifdef CONFIG_VE
> +static struct timespec64 zero_time;
> +
> +void monotonic_abs_to_ve(clockid_t which_clock, struct timespec64 *tp)
> +{
> +	struct timespec64 offset;
> +
> +	if (!clock_is_monotonic(which_clock))
> +		return;
> +
> +	offset = ns_to_timespec64(get_exec_env()->start_time);
> +	set_normalized_timespec64(tp,
> +				  tp->tv_sec - offset.tv_sec,
> +				  tp->tv_nsec - offset.tv_nsec);
> +}
> +
> +void monotonic_ve_to_abs(clockid_t which_clock, struct timespec64 *tp)
> +{
> +	struct timespec64 offset;
> +
> +	if (!clock_is_monotonic(which_clock))
> +		return;
> +
> +	offset = ns_to_timespec64(get_exec_env()->start_time);
> +	set_normalized_timespec64(tp,
> +				  tp->tv_sec + offset.tv_sec,
> +				  tp->tv_nsec + offset.tv_nsec);
> +
> +	if (timespec64_compare(tp, &zero_time) <= 0) {
> +		tp->tv_sec =  0;
> +		tp->tv_nsec = 1;
> +	}
> +}
> +#endif
> +
>   /*
>    * we assume that the new SIGEV_THREAD_ID shares no bits with the other
>    * SIGEV values.  Here we put out an error if this assumption fails.
> @@ -916,6 +957,9 @@ static int do_timer_settime(timer_t timer_id, int tmr_flags,
>   	if (!timr)
>   		return -EINVAL;
>   
> +	if ((flags & TIMER_ABSTIME) &&
> +	    (new_spec64->it_value.tv_sec || new_spec64->it_value.tv_nsec))
> +		monotonic_ve_to_abs(timr->it_clock, &new_spec64->it_value);
>   	kc = timr->kclock;
>   	if (WARN_ON_ONCE(!kc || !kc->timer_set))
>   		error = -EINVAL;
> @@ -1091,6 +1135,7 @@ SYSCALL_DEFINE2(clock_gettime, const clockid_t, which_clock,
>   
>   	error = kc->clock_get_timespec(which_clock, &kernel_tp);
>   
> +	monotonic_abs_to_ve(which_clock, &kernel_tp);
>   	if (!error && put_timespec64(&kernel_tp, tp))
>   		error = -EFAULT;
>   
> @@ -1173,6 +1218,7 @@ SYSCALL_DEFINE2(clock_gettime32, clockid_t, which_clock,
>   
>   	err = kc->clock_get_timespec(which_clock, &ts);
>   
> +	monotonic_abs_to_ve(which_clock, &ts);
>   	if (!err && put_old_timespec32(&ts, tp))
>   		err = -EFAULT;
>   
> @@ -1259,8 +1305,10 @@ SYSCALL_DEFINE4(clock_nanosleep, const clockid_t, which_clock, int, flags,
>   
>   	if (!timespec64_valid(&t))
>   		return -EINVAL;
> -	if (flags & TIMER_ABSTIME)
> +	if (flags & TIMER_ABSTIME) {
> +		monotonic_ve_to_abs(which_clock, &t);
>   		rmtp = NULL;
> +	}
>   	current->restart_block.nanosleep.type = rmtp ? TT_NATIVE : TT_NONE;
>   	current->restart_block.nanosleep.rmtp = rmtp;
>   
> @@ -1286,8 +1334,10 @@ SYSCALL_DEFINE4(clock_nanosleep_time32, clockid_t, which_clock, int, flags,
>   
>   	if (!timespec64_valid(&t))
>   		return -EINVAL;
> -	if (flags & TIMER_ABSTIME)
> +	if (flags & TIMER_ABSTIME) {
> +		monotonic_ve_to_abs(which_clock, &t);
>   		rmtp = NULL;
> +	}
>   	current->restart_block.nanosleep.type = rmtp ? TT_COMPAT : TT_NONE;
>   	current->restart_block.nanosleep.compat_rmtp = rmtp;
>   
> .
> 


More information about the Devel mailing list