[Devel] [PATCH vz9 1/2] sched/fair: fix use-after-free in CFS CPULIMIT active_timer on task group teardown
Pavel Tikhomirov
ptikhomirov at virtuozzo.com
Wed Mar 18 12:49:31 MSK 2026
On 3/13/26 20:35, Konstantin Khorenko wrote:
> The per-cfs_rq active_timer (CONFIG_CFS_CPULIMIT) is started by
> dec_nr_active_cfs_rqs() to defer the tg->nr_cpus_active decrement.
> Its callback sched_cfs_active_timer() dereferences cfs_rq->tg.
>
> When a task group is destroyed, unregister_fair_sched_group() tears
> down the per-CPU cfs_rq structures but never cancels the active_timer.
> If the timer fires after the cfs_rq and task_group memory have been
> freed and reallocated (all three — cfs_rq, sched_entity, and
> task_group — live in the kmalloc-1k slab cache), the callback performs
> atomic_dec() on an arbitrary kernel address, corrupting memory.
>
> This was observed as a hard lockup + NULL-pointer Oops in
> enqueue_task_fair() during task wakeup: the se->parent pointer at
> offset 128 in a sched_entity was corrupted because it shares the same
> slab offset as cfs_rq->skip (also offset 128) — a classic cross-type
> UAF in a shared slab cache.
>
> Fix this in two ways:
>
> 1. Cancel the active_timer in unregister_fair_sched_group() before the
> cfs_rq is freed. The cancel must happen outside the rq lock because
> the timer callback sched_cfs_active_timer() acquires it.
>
> 2. Move the atomic_dec(&cfs_rq->tg->nr_cpus_active) inside the rq lock
> in sched_cfs_active_timer(). In the original code the callback
> releases the rq lock before executing atomic_dec, creating a window
> where the teardown path can run between the unlock and the
> atomic_dec:
>
> CPU B (timer callback) CPU A (teardown)
> ────────────────────── ────────────────
> sched_cfs_active_timer()
> raw_spin_rq_lock()
> cfs_rq->active = ...
> raw_spin_rq_unlock()
> ← lock released, atomic_dec
> not yet executed
> unregister_fair_sched_group()
> raw_spin_rq_lock()
> list_del_leaf_cfs_rq()
This explanation and stack seem incorrect. As we already add hrtimer_cancel(&tg->cfs_rq[cpu]->active_timer); (in first part of the fix) before list_del_leaf_cfs_rq(), which will already wait for (timer callback) sched_cfs_active_timer() to finish before continuing to free tg.
So the second part of the fix is excess.
> raw_spin_rq_unlock()
> sched_free_group()
> kfree(tg)
> atomic_dec(&cfs_rq->tg->...) ← UAF! tg already freed
>
> With atomic_dec inside the rq lock, teardown's raw_spin_rq_lock()
> in unregister_fair_sched_group() cannot proceed until the callback
> has completed all accesses to cfs_rq->tg, eliminating the window.
>
> Note: hrtimer_cancel() (fix #1) also independently closes this
> race — the hrtimer core keeps base->running == timer until fn()
> returns, so hrtimer_cancel() waits for the callback to fully
> complete, including atomic_dec. Moving atomic_dec under the lock
> makes the serialization explicit via the lock protocol without
> relying on hrtimer internals.
>
> https://virtuozzo.atlassian.net/browse/VSTOR-126785
>
> Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
>
> Feature: sched: ability to limit number of CPUs available to a CT
> ---
> kernel/sched/fair.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cc5dceb9c815f..9b0fe4c8a272f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -597,9 +597,8 @@ static enum hrtimer_restart sched_cfs_active_timer(struct hrtimer *timer)
>
> raw_spin_rq_lock_irqsave(rq, flags);
> cfs_rq->active = !list_empty(&cfs_rq->tasks);
> - raw_spin_rq_unlock_irqrestore(rq, flags);
> -
> atomic_dec(&cfs_rq->tg->nr_cpus_active);
> + raw_spin_rq_unlock_irqrestore(rq, flags);
>
> return HRTIMER_NORESTART;
> }
> @@ -13020,6 +13019,16 @@ void unregister_fair_sched_group(struct task_group *tg)
> destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
>
> for_each_possible_cpu(cpu) {
> +#ifdef CONFIG_CFS_CPULIMIT
> + /*
> + * Cancel the per-cfs_rq active timer before freeing.
> + * The callback dereferences cfs_rq->tg, so failing to
> + * cancel leads to use-after-free once the tg is freed.
> + * Must be done outside the rq lock since the callback
> + * acquires it.
> + */
> + hrtimer_cancel(&tg->cfs_rq[cpu]->active_timer);
> +#endif
> if (tg->se[cpu])
> remove_entity_load_avg(tg->se[cpu]);
>
--
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.
More information about the Devel
mailing list