[Devel] [PATCH vz9 v2] sched/fair: cancel per-cfs_rq active_timer before task_group teardown

Wed Apr 22 23:50:13 MSK 2026

The per-cfs_rq active_timer (CONFIG_CFS_CPULIMIT) is armed by
dec_nr_active_cfs_rqs() to defer the tg->nr_cpus_active decrement
when a task goes to sleep.  Its callback sched_cfs_active_timer()
dereferences cfs_rq->tg.

When a task group is destroyed, unregister_fair_sched_group() tears
down the per-CPU cfs_rq structures but never cancels the active_timer.
The teardown chain is:

    sched_destroy_group()
      call_rcu()
        sched_unregister_group_rcu()
          sched_unregister_group()
            unregister_fair_sched_group()
            call_rcu()
              sched_free_group_rcu()
                sched_free_group()
                  kmem_cache_free(task_group_cache, tg)

If the timer fires during or after that sequence, the callback
performs atomic_dec() through a dangling cfs_rq->tg pointer.  All
three relevant objects - cfs_rq, sched_entity, and task_group - live
in the kmalloc-1k slab cache, so once the slot is reused the
atomic_dec() lands on an arbitrary kernel address and silently
corrupts memory.

This was observed as a hard lockup and NULL-pointer oops in
enqueue_task_fair() during task wakeup: the se->parent pointer at
offset 128 in a sched_entity was corrupted because it shares the same
slab offset as cfs_rq->skip (also offset 128), a classic cross-type
UAF in a shared slab cache.

Fix this by cancelling the active_timer in unregister_fair_sched_group()
before the teardown proceeds.  The cancellation:

  - goes before the on_list early-continue: active_timer state is
    independent of leaf-list membership, and the timer is always
    initialized in init_cfs_rq_runtime(), so hrtimer_cancel() is safe
    unconditionally;

  - stays outside the rq lock: the callback itself takes that lock,
    so calling hrtimer_cancel() under it would deadlock.

hrtimer_cancel() blocks until a running callback has fully returned
from its fn() (base->running is cleared only after fn() completes),
so after this call neither a pending nor an in-flight callback can
still be racing with the teardown.  Since the only path to
kmem_cache_free(task_group_cache) goes through
unregister_fair_sched_group() first, this closes the UAF on its own;
no additional serialization of the atomic_dec() against the teardown
is needed.

Fixes: 831465734a10 ("sched: Port CONFIG_CFS_CPULIMIT feature")
https://virtuozzo.atlassian.net/browse/VSTOR-126785

Feature: sched: ability to limit number of CPUs available to a CT
Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
---
 kernel/sched/fair.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab2a890cccd4e..202d9b0e08fb5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13023,6 +13023,23 @@ void unregister_fair_sched_group(struct task_group *tg)
 	destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
 
 	for_each_possible_cpu(cpu) {
+#ifdef CONFIG_CFS_CPULIMIT
+		/*
+		 * Cancel the per-cfs_rq active_timer before the tg/cfs_rq
+		 * memory can be freed.  The callback dereferences cfs_rq->tg,
+		 * so failing to cancel would leave a use-after-free window
+		 * once the tg is freed via the RCU callback that follows
+		 * this teardown.  hrtimer_cancel() blocks until a running
+		 * callback has fully returned, which is sufficient on its
+		 * own: the only path to kmem_cache_free(task_group_cache)
+		 * goes through this function first.
+		 * Must be done outside the rq lock - the callback acquires
+		 * it.  The active_timer is always initialized in
+		 * init_cfs_rq_runtime(), so hrtimer_cancel() is safe
+		 * regardless of the on_list state below.
+		 */
+		hrtimer_cancel(&tg->cfs_rq[cpu]->active_timer);
+#endif
 		if (tg->se[cpu])
 			remove_entity_load_avg(tg->se[cpu]);
 
-- 
2.43.0