[Devel] [PATCH vz10 v2 2/2] sched: Support nr_cpus in cgroup2 as well

Wed Mar 18 17:48:42 MSK 2026

On 3/17/26 09:33, Dmitry Sepp wrote:
> Make the control available for the cgroup2 hierarchy as well.
> 
> https://virtuozzo.atlassian.net/browse/VSTOR-124385
> 
> Signed-off-by: Dmitry Sepp <dmitry.sepp at virtuozzo.com>
> ---
>   kernel/sched/core.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f66ee9d07387..3b13fd3a3f7a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10431,6 +10431,13 @@ static struct cftype cpu_files[] = {
>   		.seq_show = cpu_uclamp_max_show,
>   		.write = cpu_uclamp_max_write,
>   	},
> +#endif
> +#ifdef CONFIG_CFS_CPULIMIT
> +	{
> +		.name = "nr_cpus",

May be to add
           .flags = CFTYPE_NOT_ON_ROOT,

like most of other entries here?

> +		.read_u64 = nr_cpus_read_u64,
> +		.write_u64 = nr_cpus_write_u64,
> +	},
>   #endif
>   	{
>   		.name = "proc.stat",

Also while we are here, can you please fix another related issue?

   Bug: Missing `cpus_read_lock()` in `tg_set_cpu_limit()`
   tg_set_cpu_limit() calls __tg_set_cfs_bandwidth(), which iterates over for_each_online_cpu(i) and 
takes per-CPU rq
   locks. However, tg_set_cpu_limit() does not hold cpus_read_lock():

    kernel/sched/core.c lines 10025-10031

       mutex_lock(&cfs_constraints_mutex);
       ret = __tg_set_cfs_bandwidth(tg, period, quota, burst);
       if (!ret) {
           tg->cpu_rate = cpu_rate;
           tg->nr_cpus = nr_cpus;
       }
       mutex_unlock(&cfs_constraints_mutex);

   Compare with tg_set_cfs_bandwidth(), which does it correctly:

    kernel/sched/core.c lines 9734-9743

   {
       int ret;
       guard(cpus_read_lock)();
       guard(mutex)(&cfs_constraints_mutex);
       ret = __tg_set_cfs_bandwidth(tg, period, quota, burst);
       tg_update_cpu_limit(tg);
       return ret;
   }

   The requirement to hold cpus_read_lock() was introduced by upstream commit 0e59bdaea75f 
("sched/fair: Disable runtime_enabled on dying rq"), which changed the iteration in 
__tg_set_cfs_bandwidth() from for_each_possible_cpu to for_each_online_cpu and added 
get_online_cpus()/put_online_cpus() around the call. This was done to prevent a race  between setting 
cfs_rq->runtime_enabled and unthrottle_offline_cfs_rqs().
   If a CPU goes offline while __tg_set_cfs_bandwidth() is executing inside tg_set_cpu_limit(), the 
function may re-enable runtime_enabled on a dying CPU's cfs_rq after unthrottle_offline_cfs_rqs() has 
already cleared it, leaving  tasks stranded on a dead CPU with no way to migrate.
   The bug was inherited from the original commit 4514c5835d32f ("sched: Port CONFIG_CFS_CPULIMIT 
feature"), where  tg_set_cpu_limit() was ported from vz7 (kernel 3.10) without accounting for the 
changed locking requirements. In the  vz7 kernel, __tg_set_cfs_bandwidth() used for_each_possible_cpu, 
so cpus_read_lock() was not needed.

==================================================

+ another issue with cpu.max vs ns_cpus behavior:

   Semantics: `cpu.nr_cpus` becomes passive after writing to `cpu.max`
   After the first patch, writing to cpu.max no longer resets nr_cpus (which is good), but it does not 
re-apply it either.
   The code path when writing cpu.max:
   cpu_max_write() → tg_set_cfs_bandwidth() → __tg_set_cfs_bandwidth() (sets quota/period directly) → 
tg_update_cpu_limit()
    (recalculates cpu_rate from quota/period, does not touch nr_cpus)
   This leads to a confusing scenario:

   echo 2 > cpu.nr_cpus          # limit = 2 CPUs (via CFS bandwidth)
   echo "max 100000" > cpu.max   # remove the limit
   cat cpu.nr_cpus                # reads 2 ← but there is no actual limit!

   nr_cpus is stored but has no effect until someone writes to cpu.nr_cpus again. In cgroup v2, where 
both files are
   visible side by side, this can mislead the user into thinking a CPU limit is in place when it is not.
   Possible ways to address this:
   • Make tg_update_cpu_limit() take nr_cpus into account (re-apply it when cpu.max is written)
   • Reset nr_cpus = 0 when cpu.max is written (as it was before the first patch, though that behavior 
was intentionally
     removed)