[Devel] [PATCH vz9 1/2] sched/fair: fix use-after-free in CFS CPULIMIT active_timer on task group teardown

Wed Mar 18 12:49:31 MSK 2026

On 3/13/26 20:35, Konstantin Khorenko wrote:
> The per-cfs_rq active_timer (CONFIG_CFS_CPULIMIT) is started by
> dec_nr_active_cfs_rqs() to defer the tg->nr_cpus_active decrement.
> Its callback sched_cfs_active_timer() dereferences cfs_rq->tg.
> 
> When a task group is destroyed, unregister_fair_sched_group() tears
> down the per-CPU cfs_rq structures but never cancels the active_timer.
> If the timer fires after the cfs_rq and task_group memory have been
> freed and reallocated (all three — cfs_rq, sched_entity, and
> task_group — live in the kmalloc-1k slab cache), the callback performs
> atomic_dec() on an arbitrary kernel address, corrupting memory.
> 
> This was observed as a hard lockup + NULL-pointer Oops in
> enqueue_task_fair() during task wakeup: the se->parent pointer at
> offset 128 in a sched_entity was corrupted because it shares the same
> slab offset as cfs_rq->skip (also offset 128) — a classic cross-type
> UAF in a shared slab cache.
> 
> Fix this in two ways:
> 
> 1. Cancel the active_timer in unregister_fair_sched_group() before the
>    cfs_rq is freed. The cancel must happen outside the rq lock because
>    the timer callback sched_cfs_active_timer() acquires it.
> 
> 2. Move the atomic_dec(&cfs_rq->tg->nr_cpus_active) inside the rq lock
>    in sched_cfs_active_timer(). In the original code the callback
>    releases the rq lock before executing atomic_dec, creating a window
>    where the teardown path can run between the unlock and the
>    atomic_dec:
> 
>      CPU B (timer callback)              CPU A (teardown)
>      ──────────────────────              ────────────────
>      sched_cfs_active_timer()
>        raw_spin_rq_lock()
>        cfs_rq->active = ...
>        raw_spin_rq_unlock()
>        ← lock released, atomic_dec
>          not yet executed
>                                          unregister_fair_sched_group()
>                                            raw_spin_rq_lock()
>                                            list_del_leaf_cfs_rq()

This explanation and stack seem incorrect. As we already add hrtimer_cancel(&tg->cfs_rq[cpu]->active_timer); (in first part of the fix) before list_del_leaf_cfs_rq(), which will already wait for (timer callback) sched_cfs_active_timer() to finish before continuing to free tg.

So the second part of the fix is excess.

>                                            raw_spin_rq_unlock()
>                                          sched_free_group()
>                                            kfree(tg)
>        atomic_dec(&cfs_rq->tg->...)  ← UAF! tg already freed
> 
>    With atomic_dec inside the rq lock, teardown's raw_spin_rq_lock()
>    in unregister_fair_sched_group() cannot proceed until the callback
>    has completed all accesses to cfs_rq->tg, eliminating the window.
> 
>    Note: hrtimer_cancel() (fix #1) also independently closes this
>    race — the hrtimer core keeps base->running == timer until fn()
>    returns, so hrtimer_cancel() waits for the callback to fully
>    complete, including atomic_dec.  Moving atomic_dec under the lock
>    makes the serialization explicit via the lock protocol without
>    relying on hrtimer internals.
> 
> https://virtuozzo.atlassian.net/browse/VSTOR-126785
> 
> Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
> 
> Feature: sched: ability to limit number of CPUs available to a CT
> ---
>  kernel/sched/fair.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cc5dceb9c815f..9b0fe4c8a272f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -597,9 +597,8 @@ static enum hrtimer_restart sched_cfs_active_timer(struct hrtimer *timer)
>  
>  	raw_spin_rq_lock_irqsave(rq, flags);
>  	cfs_rq->active = !list_empty(&cfs_rq->tasks);
> -	raw_spin_rq_unlock_irqrestore(rq, flags);
> -
>  	atomic_dec(&cfs_rq->tg->nr_cpus_active);
> +	raw_spin_rq_unlock_irqrestore(rq, flags);
>  
>  	return HRTIMER_NORESTART;
>  }
> @@ -13020,6 +13019,16 @@ void unregister_fair_sched_group(struct task_group *tg)
>  	destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
>  
>  	for_each_possible_cpu(cpu) {
> +#ifdef CONFIG_CFS_CPULIMIT
> +		/*
> +		 * Cancel the per-cfs_rq active timer before freeing.
> +		 * The callback dereferences cfs_rq->tg, so failing to
> +		 * cancel leads to use-after-free once the tg is freed.
> +		 * Must be done outside the rq lock since the callback
> +		 * acquires it.
> +		 */
> +		hrtimer_cancel(&tg->cfs_rq[cpu]->active_timer);
> +#endif
>  		if (tg->se[cpu])
>  			remove_entity_load_avg(tg->se[cpu]);
>  

-- 
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.