[Devel] [PATCH vz7] sched/fair: cancel per-cfs_rq active_timer before task_group teardown
Konstantin Khorenko
khorenko at virtuozzo.com
Wed Apr 22 23:32:08 MSK 2026
The per-cfs_rq active_timer (CONFIG_CFS_CPULIMIT) is armed by
dec_nr_active_cfs_rqs() to defer the tg->nr_cpus_active decrement
when a task goes to sleep. Its callback sched_cfs_active_timer()
dereferences cfs_rq->tg.
When a task group is destroyed, unregister_fair_sched_group() tears
down the per-CPU cfs_rq structures but never cancels the active_timer.
The caller, sched_offline_group(), then proceeds to list_del_rcu() the
task_group and schedules free_sched_group_rcu() via call_rcu(), which
eventually kfree()s the task_group. If the timer fires during or
after that sequence, the callback performs atomic_dec() through a
dangling cfs_rq->tg pointer. All three relevant objects - cfs_rq,
sched_entity, and task_group - live in the kmalloc-1k slab cache, so
once the slot is reused the atomic_dec() lands on an arbitrary kernel
address and silently corrupts memory.
Fix this by cancelling the active_timer in unregister_fair_sched_group()
before the teardown proceeds. The cancellation:
- goes before the on_list early-return: active_timer state is
independent of leaf-list membership, and the timer is always
initialized in init_cfs_rq_runtime(), so hrtimer_cancel() is safe
unconditionally;
- stays outside the rq lock: the callback itself takes that lock,
so calling hrtimer_cancel() under it would deadlock.
hrtimer_cancel() blocks until a running callback has fully returned
from its fn() (base->running is cleared only after fn() completes),
so after this call neither a pending nor an in-flight callback can
still be racing with the teardown. Since the only path to kfree(tg)
goes through unregister_fair_sched_group() first, this closes the UAF
on its own; no additional serialization of the atomic_dec() against
the teardown is needed.
Fixes: f3fec68860fb ("ve/sched: port vcpu hotslice")
https://virtuozzo.atlassian.net/browse/PSBM-161930
Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
---
kernel/sched/fair.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 552d288d81648..b5a7e9f72e6d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8544,6 +8544,22 @@ void unregister_fair_sched_group(struct task_group *tg, int cpu)
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
+#ifdef CONFIG_CFS_CPULIMIT
+ /*
+ * Cancel the per-cfs_rq active_timer before the tg/cfs_rq memory
+ * can be freed. The callback dereferences cfs_rq->tg, so failing
+ * to cancel would leave a use-after-free window once the tg is
+ * freed via the RCU callback that follows this teardown.
+ * hrtimer_cancel() blocks until a running callback has fully
+ * returned, which is sufficient on its own: the only path to
+ * kfree(tg) goes through this function first.
+ * Must be done outside the rq lock - the callback acquires it.
+ * Active_timer is always initialized in init_cfs_rq_runtime(), so
+ * hrtimer_cancel() is safe regardless of the on_list state below.
+ */
+ hrtimer_cancel(&tg->cfs_rq[cpu]->active_timer);
+#endif
+
/*
* Only empty task groups can be destroyed; so we can speculatively
* check on_list without danger of it being re-added.
--
2.43.0
More information about the Devel
mailing list