<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Yes, it is possible in theory. I'll patch this place. Thanks for observation.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Kirill Tkhai <ktkhai@virtuozzo.com><br>
<b>Sent:</b> Tuesday, March 24, 2020 2:16 PM<br>
<b>To:</b> Valeriy Vdovin <Valeriy.Vdovin@virtuozzo.com>; devel@openvz.org <devel@openvz.org>; Konstantin Khorenko <khorenko@virtuozzo.com>; Pavel Tikhomirov <ptikhomirov@virtuozzo.com><br>
<b>Subject:</b> Re: [PATCH 2/2 v3] ve/cgroup: Added pointers to owning ve to root cgroups</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">On 24.03.2020 13:46, Valeriy Vdovin wrote:<br>
> Follow-up patch to per-cgroup release_agent property. release_agent<br>
> notifications are spawned from a special kthread, running under ve0. But<br>
> newly spawned tasks should run under their own ve context. Easy way to<br>
> pass this information to a spawning thread is by adding 've_owner' field<br>
> to a root cgroup. At notification any cgroup can be walked upwards to<br>
> it's root and get ve_owner from there. ve_owner is initialized at container<br>
> start, that's when we know all cgroup ve roots for free. For ve0 roots<br>
> we don't need ve_owner field set, because we can always get pointer to ve0.<br>
> Only place where we can detch ve_owner knowledge from root ve cgroups is<br>
> at cgroup_exit for ve->init_task, because that's that's the last time, when<br>
> ve->init_task is related to it's ve root cgroups (see cgroup_exit comments).<br>
> <br>
> <a href="https://jira.sw.ru/browse/PSBM-83887">https://jira.sw.ru/browse/PSBM-83887</a><br>
> Signed-off-by: Valeriy Vdovin <valeriy.vdovin@virtuozzo.com><br>
> ---<br>
> include/linux/cgroup.h | 3 +++<br>
> kernel/cgroup.c | 45 +++++++++++++++++++++++++++++++++++++++++++++<br>
> 2 files changed, 48 insertions(+)<br>
> <br>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h<br>
> index 5e289fc..34fc017 100644<br>
> --- a/include/linux/cgroup.h<br>
> +++ b/include/linux/cgroup.h<br>
> @@ -291,6 +291,9 @@ struct cgroup {<br>
> struct simple_xattrs xattrs;<br>
> u64 subgroups_limit;<br>
> <br>
> + /* ve_owner, responsible for running release agent. */<br>
> + struct ve_struct *ve_owner;<br>
> +<br>
> /*<br>
> * Per-cgroup path to release agent binary for release<br>
> * notifications.<br>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c<br>
> index d1d4605..2500095 100644<br>
> --- a/kernel/cgroup.c<br>
> +++ b/kernel/cgroup.c<br>
> @@ -4352,6 +4352,7 @@ int cgroup_mark_ve_root(struct ve_struct *ve)<br>
> mutex_lock(&cgroup_mutex);<br>
> for_each_active_root(root) {<br>
> cgrp = task_cgroup_from_root(ve->init_task, root);<br>
> + cgrp->ve_owner = ve;<br>
> set_bit(CGRP_VE_ROOT, &cgrp->flags);<br>
> err = cgroup_add_file_on_mark_ve(cgrp);<br>
> if (err)<br>
> @@ -4363,6 +4364,20 @@ int cgroup_mark_ve_root(struct ve_struct *ve)<br>
> return err;<br>
> }<br>
> <br>
> +static void cgroup_unbind_roots_from_ve(struct ve_struct *ve)<br>
> +{<br>
> + struct cgroup *cgrp;<br>
> + struct cgroupfs_root *root;<br>
> +<br>
> + mutex_lock(&cgroup_mutex);<br>
> + for_each_active_root(root) {<br>
> + cgrp = task_cgroup_from_root(ve->init_task, root);<br>
> + BUG_ON(!cgrp->ve_owner);<br>
> + cgrp->ve_owner = NULL;<br>
> + }<br>
> + mutex_unlock(&cgroup_mutex);<br>
> +}<br>
> +<br>
> struct cgroup *cgroup_get_ve_root(struct cgroup *cgrp)<br>
> {<br>
> struct cgroup *ve_root = NULL;<br>
> @@ -5401,6 +5416,14 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)<br>
> int i;<br>
> <br>
> /*<br>
> + * Detect that the task in question is ve->init_task<br>
> + * if so, unbind all cgroups that want to be released under<br>
> + * this ve.<br>
> + */<br>
> + if (!ve_is_super(tsk->task_ve) && tsk == tsk->task_ve->init_task)<br>
> + cgroup_unbind_roots_from_ve(tsk->task_ve);<br>
> +<br>
> + /*<br>
> * Unlink from the css_set task list if necessary.<br>
> * Optimistically check cg_list before taking<br>
> * css_set_lock<br>
> @@ -5493,6 +5516,7 @@ static void cgroup_release_agent(struct work_struct *work)<br>
> raw_spin_lock(&release_list_lock);<br>
> while (!list_empty(&release_list)) {<br>
> char *argv[3], *envp[3];<br>
> + struct ve_struct *ve = NULL;<br>
> int i, err;<br>
> const char *release_agent;<br>
> char *pathbuf = NULL, *agentbuf = NULL;<br>
> @@ -5507,8 +5531,24 @@ static void cgroup_release_agent(struct work_struct *work)<br>
> goto continue_free;<br>
> if (__cgroup_path(cgrp, pathbuf, PAGE_SIZE, true) < 0)<br>
> goto continue_free;<br>
> +<br>
> + /*<br>
> + * Deduce under which VE we are going to spawn nofitier<br>
> + * binary. Non-root VE has it's VE-local root cgroup,<br>
> + * marked with VE_ROOT flag. It has non-NULL ve_owner<br>
> + * set during cgroup_mark_ve_root.<br>
> + * VE0 root cgroup does not need ve_owner field, because<br>
> + * it's ve is ve0. Non-VE root cgroup does not have a parent.<br>
> + */<br>
> root_cgrp = cgroup_get_local_root(cgrp);<br>
> + if (root_cgrp->parent == NULL)<br>
> + ve = &ve0;<br>
> + else<br>
> + ve = root_cgrp->ve_owner;<br>
> + if (!ve)<br>
> + goto continue_free;<br>
> rcu_read_lock();<br>
> +<br>
> release_agent = cgroup_release_agent_path(root_cgrp);<br>
> if (release_agent)<br>
> agentbuf = kstrdup(release_agent, GFP_KERNEL);<br>
> @@ -5531,7 +5571,12 @@ static void cgroup_release_agent(struct work_struct *work)<br>
> * since the exec could involve hitting disk and hence<br>
> * be a slow process */<br>
> mutex_unlock(&cgroup_mutex);<br>
> +#ifdef CONFIG_VE<br>
> + err = call_usermodehelper_fns_ve(ve, argv[0], argv,<br>
> + envp, UMH_WAIT_EXEC, NULL, NULL, NULL);<br>
> +#else<br>
> err = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);<br>
> +#endif<br>
> if (err < 0)<br>
> pr_warn_ratelimited("cgroup release_agent "<br>
> "%s %s failed: %d\n",<br>
<br>
Is there a race like the below?<br>
<br>
[cpu 0] [cpu1]<br>
cgroup_release_agent()<br>
mutex_lock(&cgroup_mutex)<br>
ve = root_cgrp->ve_owner<br>
mutex_unlock(&cgroup_mutex)<br>
<br>
cgroup_unbind_roots_from_ve(ve)<br>
cgroup->ve_owner = NULL;<br>
css_put(ve) <--- final put<br>
<br>
<br>
call_usermodehelper_fns_ve(ve)<br>
get_ve(ve) <--- gets zero counter<br>
</div>
</span></font></div>
</body>
</html>