[Devel] [PATCH vz7] sched/fair: cancel per-cfs_rq active_timer before task_group teardown

Fri May 15 11:35:02 MSK 2026

Reviewed-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>

On 4/22/26 22:32, Konstantin Khorenko wrote:
> The per-cfs_rq active_timer (CONFIG_CFS_CPULIMIT) is armed by
> dec_nr_active_cfs_rqs() to defer the tg->nr_cpus_active decrement
> when a task goes to sleep.  Its callback sched_cfs_active_timer()
> dereferences cfs_rq->tg.
> 
> When a task group is destroyed, unregister_fair_sched_group() tears
> down the per-CPU cfs_rq structures but never cancels the active_timer.
> The caller, sched_offline_group(), then proceeds to list_del_rcu() the

Technically the caller of sched_offline_group() right? Cause "offline"
normally does not do freeing. It's either cpu_cgroup_css_free() or
autogroup_destroy() which does actual free.

I don't see any mechanism to prevent "free" part from happening while
the timer is still armed (I don't think there is cg/css refcount taken).
So the motivation looks correct.

> task_group and schedules free_sched_group_rcu() via call_rcu(), which
> eventually kfree()s the task_group.  If the timer fires during or
> after that sequence, the callback performs atomic_dec() through a
> dangling cfs_rq->tg pointer.  All three relevant objects - cfs_rq,
> sched_entity, and task_group - live in the kmalloc-1k slab cache, so
> once the slot is reused the atomic_dec() lands on an arbitrary kernel
> address and silently corrupts memory.
> 
> Fix this by cancelling the active_timer in unregister_fair_sched_group()
> before the teardown proceeds.  The cancellation:
> 
>   - goes before the on_list early-return: active_timer state is
>     independent of leaf-list membership, and the timer is always
>     initialized in init_cfs_rq_runtime(), so hrtimer_cancel() is safe
>     unconditionally;
> 
>   - stays outside the rq lock: the callback itself takes that lock,
>     so calling hrtimer_cancel() under it would deadlock.
> 
> hrtimer_cancel() blocks until a running callback has fully returned
> from its fn() (base->running is cleared only after fn() completes),
> so after this call neither a pending nor an in-flight callback can
> still be racing with the teardown.  Since the only path to kfree(tg)
> goes through unregister_fair_sched_group() first, this closes the UAF
> on its own; no additional serialization of the atomic_dec() against
> the teardown is needed.
> 
> Fixes: f3fec68860fb ("ve/sched: port vcpu hotslice")
> https://virtuozzo.atlassian.net/browse/PSBM-161930
> Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
> ---
>  kernel/sched/fair.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 552d288d81648..b5a7e9f72e6d1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8544,6 +8544,22 @@ void unregister_fair_sched_group(struct task_group *tg, int cpu)
>  	struct rq *rq = cpu_rq(cpu);
>  	unsigned long flags;
>  
> +#ifdef CONFIG_CFS_CPULIMIT
> +	/*
> +	 * Cancel the per-cfs_rq active_timer before the tg/cfs_rq memory
> +	 * can be freed.  The callback dereferences cfs_rq->tg, so failing
> +	 * to cancel would leave a use-after-free window once the tg is
> +	 * freed via the RCU callback that follows this teardown.
> +	 * hrtimer_cancel() blocks until a running callback has fully
> +	 * returned, which is sufficient on its own: the only path to
> +	 * kfree(tg) goes through this function first.
> +	 * Must be done outside the rq lock - the callback acquires it.
> +	 * Active_timer is always initialized in init_cfs_rq_runtime(), so
> +	 * hrtimer_cancel() is safe regardless of the on_list state below.
> +	 */
> +	hrtimer_cancel(&tg->cfs_rq[cpu]->active_timer);
> +#endif
> +
>  	/*
>  	* Only empty task groups can be destroyed; so we can speculatively
>  	* check on_list without danger of it being re-added.

-- 
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.