<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<font size="2"><span style="font-size:11pt">>> + spin_lock(&ve->referring_cgroups_lock);<br>
>> + list_add_tail(&ve->referring_cgroups, &cgrp->ve_reference);<br>
>> + spin_unlock(&ve->referring_cgroups_lock);<br>
><br>
>How does this work?<br>
><br>
>ve is entry here, and you link it to ve_reference list.<br>
>In case of this function is called for another cgroup from the same ve,<br>
>ve will be linked at another list. How can it be linked to two lists<br>
>at the same time?</span></font></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<font size="2"><span style="font-size:11pt"><br>
</span></font></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<font size="2"><span style="font-size:11pt">Yes, the agrument ordering in list_add_tail is wrong. cgroup is being added to ve's list. My bug, thanks.</span></font><br>
</div>
<div id="appendonsend"></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<hr tabindex="-1" style="display:inline-block; width:98%">
<div id="divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b> Kirill Tkhai <ktkhai@virtuozzo.com><br>
<b>Sent:</b> Friday, March 13, 2020 2:08 PM<br>
<b>To:</b> Valeriy Vdovin <Valeriy.Vdovin@virtuozzo.com>; devel@openvz.org <devel@openvz.org>; Konstantin Khorenko <khorenko@virtuozzo.com>; Pavel Tikhomirov <ptikhomirov@virtuozzo.com><br>
<b>Subject:</b> Re: [Devel] [PATCH 2/2 v0] ve/cgroup: Added cross links between cgroups and owning ve</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt">
<div class="PlainText">On 13.03.2020 13:09, Valeriy Vdovin wrote:<br>
> Follow-up patch to per-cgroup release_agent property. Currently there is<br>
> a problem with release_agent notification semantics that all<br>
> notification processes spawn under ve0 (host). Due to that any processes<br>
> inside of a container expecting a notification will never get it<br>
> because notification will never hit container namespaces. To address<br>
> this problem a process should be launched from a namespace, relative to<br>
> cgroup owning ve. So we need to somehow know the right 've' during<br>
> spawning of a notification process. This time it's not possible to use<br>
> current->task_ve, because notifications are spawned from a dedicated<br>
> kthread which knows which cgroup needs to be released, but knows nothing<br>
> about what ve 'owns' this cgroup. Naturally, this can be solved by<br>
> adding 've_owner' field to 'struct cgroup'.<br>
> <br>
> It's not enough because cgroup's lifespan isn't bound to ve's, ve<br>
> might be destroyed earlier than cgroup and opposite is also true. At any<br>
> time a link to ve from cgroup might become invalid without proper<br>
> cleanup.<br>
> <br>
> To manage this we add a list of all cgroups into ve structure. All<br>
> cgroups that need a reference to owning ve, are added to this list. At<br>
> destruction ve will clean all references on itself from cgroups in this<br>
> list.<br>
> <br>
> <a href="https://jira.sw.ru/browse/PSBM-83887">https://jira.sw.ru/browse/PSBM-83887</a><br>
> Signed-off-by: Valeriy Vdovin <valeriy.vdovin@virtuozzo.com><br>
> ---<br>
> include/linux/cgroup.h | 11 ++++++++<br>
> include/linux/ve.h | 19 +++++++++++++<br>
> kernel/cgroup.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++<br>
> kernel/ve/ve.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++<br>
> 4 files changed, 178 insertions(+)<br>
> <br>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h<br>
> index cad5b4f..833f249 100644<br>
> --- a/include/linux/cgroup.h<br>
> +++ b/include/linux/cgroup.h<br>
> @@ -287,6 +287,17 @@ struct cgroup {<br>
> u64 subgroups_limit;<br>
> <br>
> /*<br>
> + * Cgroups that have non-empty release_agent_path get in this<br>
> + * this list. At ve destruction, ve will walk this list<br>
> + * to unlink itself from each cgroup.<br>
> + */<br>
> + struct list_head ve_reference;<br>
> +<br>
> + /* ve_owner, responsible for running release agent. */<br>
> + struct ve_struct *ve_owner;<br>
> + struct mutex ve_owner_mutex;<br>
> +<br>
> + /*<br>
> * Per-cgroup path to release agent binary for release<br>
> * notifications.<br>
> */<br>
> diff --git a/include/linux/ve.h b/include/linux/ve.h<br>
> index 9d60838..49f772c 100644<br>
> --- a/include/linux/ve.h<br>
> +++ b/include/linux/ve.h<br>
> @@ -91,6 +91,17 @@ struct ve_struct {<br>
> <br>
> unsigned long down_at;<br>
> struct list_head cleanup_list;<br>
> + /*<br>
> + * cgroups that need to be unreferenced from this ve<br>
> + * at this ve's destruction<br>
> + */<br>
> + struct list_head referring_cgroups;<br>
> +<br>
> + /*<br>
> + * operations on referring_cgroups are performed under this<br>
> + * lock<br>
> + */<br>
> + spinlock_t referring_cgroups_lock;<br>
> unsigned long meminfo_val;<br>
> int _randomize_va_space;<br>
> <br>
> @@ -268,6 +279,14 @@ static inline struct cgroup *cgroup_get_ve_root(struct cgroup *cgrp)<br>
> struct seq_file;<br>
> struct kernel_cpustat;<br>
> <br>
> +/*<br>
> + * cgroup needs to know it's owning ve for some of operations, but<br>
> + * cgroup's lifetime is independant of ve's, in theory ve can be destroyed<br>
> + * earlier than some of it's cgroups.<br>
> + */<br>
> +void ve_add_referring_cgroup(struct ve_struct *ve, struct cgroup *cgrp);<br>
> +void ve_remove_referring_cgroups(struct ve_struct *ve);<br>
> +<br>
> #if defined(CONFIG_VE) && defined(CONFIG_CGROUP_SCHED)<br>
> int ve_show_cpu_stat(struct ve_struct *ve, struct seq_file *p);<br>
> int ve_show_loadavg(struct ve_struct *ve, struct seq_file *p);<br>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c<br>
> index 0b64d88..3ecdb5b 100644<br>
> --- a/kernel/cgroup.c<br>
> +++ b/kernel/cgroup.c<br>
> @@ -1442,10 +1442,12 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)<br>
> INIT_LIST_HEAD(&cgrp->allcg_node);<br>
> INIT_LIST_HEAD(&cgrp->release_list);<br>
> INIT_LIST_HEAD(&cgrp->pidlists);<br>
> + INIT_LIST_HEAD(&cgrp->ve_reference);<br>
> mutex_init(&cgrp->pidlist_mutex);<br>
> INIT_LIST_HEAD(&cgrp->event_list);<br>
> spin_lock_init(&cgrp->event_list_lock);<br>
> simple_xattrs_init(&cgrp->xattrs);<br>
> + mutex_init(&cgrp->ve_owner_mutex);<br>
> }<br>
> <br>
> static void init_cgroup_root(struct cgroupfs_root *root)<br>
> @@ -2325,6 +2327,14 @@ static int cgroup_release_agent_write(struct cgroup *cgrp, struct cftype *cft,<br>
> if (!cgroup_lock_live_group(cgrp))<br>
> return -ENODEV;<br>
> <br>
> +#ifdef CONFIG_VE<br>
> + /*<br>
> + * This is the only place when we can bind ve information to<br>
> + * cgroup, assuming that the writer does it inside of the<br>
> + * right VE.<br>
> + */<br>
> + ve_add_referring_cgroup(get_exec_env(), cgrp);<br>
> +#endif<br>
> ret = cgroup_set_release_agent_locked(cgrp, buffer);<br>
> mutex_unlock(&cgroup_mutex);<br>
> return ret;<br>
> @@ -4534,6 +4544,36 @@ static void css_ref_killed_fn(struct percpu_ref *ref)<br>
> cgroup_css_killed(css->cgroup);<br>
> }<br>
> <br>
> +/*<br>
> + * cgroup_unreference_ve removes cgroup's reference to ve<br>
> + * ve_owner_mutex ensures ve pointer is valid, then checks<br>
> + * if cgoup is still in ve's list. Otherwise ve is already<br>
> + * removing cgroup itself, so we dont' need to do it.<br>
> + */<br>
> +void cgroup_unreference_ve(struct cgroup *cgrp)<br>
> +{<br>
> + struct list_head *pos;<br>
> + /*<br>
> + * We are only safe to query ve_owner under it's lock, cause<br>
> + * ve could zero it out at destruction step.<br>
> + */<br>
> + mutex_lock(&cgrp->ve_owner_mutex);<br>
> + if (cgrp->ve_owner) {<br>
> + struct ve_struct *ve = cgrp->ve_owner;<br>
> + spin_lock(&ve->referring_cgroups_lock);<br>
> + list_for_each(pos, &ve->referring_cgroups) {<br>
> + if (pos == &cgrp->ve_reference) {<br>
> + list_del(&cgrp->ve_reference);<br>
> + break;<br>
> + }<br>
> + }<br>
> + spin_unlock(&ve->referring_cgroups_lock);<br>
> + cgrp->ve_owner = NULL;<br>
> + }<br>
> + mutex_unlock(&cgrp->ve_owner_mutex);<br>
> +}<br>
> +<br>
> +<br>
> /**<br>
> * cgroup_destroy_locked - the first stage of cgroup destruction<br>
> * @cgrp: cgroup to be destroyed<br>
> @@ -4645,6 +4685,8 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)<br>
> }<br>
> spin_unlock(&cgrp->event_list_lock);<br>
> <br>
> + cgroup_unreference_ve(cgrp);<br>
> +<br>
> kfree(cgrp->release_agent_path);<br>
> <br>
> return 0;<br>
> @@ -5455,6 +5497,7 @@ static void cgroup_release_agent(struct work_struct *work)<br>
> raw_spin_lock(&release_list_lock);<br>
> while (!list_empty(&release_list)) {<br>
> char *argv[3], *envp[3];<br>
> + struct ve_struct *ve;<br>
> int i, err;<br>
> char *pathbuf = NULL, *agentbuf = NULL;<br>
> struct cgroup *root_cgrp;<br>
> @@ -5468,7 +5511,33 @@ static void cgroup_release_agent(struct work_struct *work)<br>
> goto continue_free;<br>
> if (__cgroup_path(cgrp, pathbuf, PAGE_SIZE, true) < 0)<br>
> goto continue_free;<br>
> +<br>
> + /*<br>
> + * root_cgrp is the relative root for cgrp, for host<br>
> + * cgroups root_cgrp is root->top_cgroup, for container<br>
> + * cgroups it is any up the parent chain from cgrp marked<br>
> + * as VE_ROOT.<br>
> + */<br>
> root_cgrp = cgroup_get_local_root(cgrp);<br>
> +<br>
> + /*<br>
> + * Under lock we are safe to check that ve_owner is not<br>
> + * NULL. Until cgroup has non-NULL pointer to VE, we are<br>
> + * safe to increase it's refcount and then until reference<br>
> + * is held, we are safe to address it's memory. If at this<br>
> + * point VE undergoes destruction steps, at maximum<br>
> + * call_usermodehelper_fns_ve will gracefully fail, but<br>
> + * nobody will care at desctuction.<br>
> + */<br>
> + ve = NULL;<br>
> + mutex_lock(&root_cgrp->ve_owner_mutex);<br>
> + if (root_cgrp->ve_owner) {<br>
> + ve = root_cgrp->ve_owner;<br>
> + get_ve(ve);<br>
> + }<br>
> + mutex_unlock(&root_cgrp->ve_owner_mutex);<br>
> + if (!ve)<br>
> + goto continue_free;<br>
> if (root_cgrp->release_agent_path)<br>
> agentbuf = kstrdup(root_cgrp->release_agent_path,<br>
> GFP_KERNEL);<br>
> @@ -5490,7 +5559,13 @@ static void cgroup_release_agent(struct work_struct *work)<br>
> * since the exec could involve hitting disk and hence<br>
> * be a slow process */<br>
> mutex_unlock(&cgroup_mutex);<br>
> +#ifdef CONFIG_VE<br>
> + err = call_usermodehelper_fns_ve(ve, argv[0], argv,<br>
> + envp, UMH_WAIT_EXEC, NULL, NULL, NULL);<br>
> + put_ve(ve);<br>
> +#else<br>
> err = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);<br>
> +#endif<br>
> if (err < 0)<br>
> pr_warn_ratelimited("cgroup release_agent "<br>
> "%s %s failed: %d\n",<br>
> diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c<br>
> index a64b4a7..4ead5cc 100644<br>
> --- a/kernel/ve/ve.c<br>
> +++ b/kernel/ve/ve.c<br>
> @@ -575,6 +575,13 @@ void ve_stop_ns(struct pid_namespace *pid_ns)<br>
> ve->is_running = 0;<br>
> <br>
> /*<br>
> + * After is_running is cleared, we are safe to cleanup<br>
> + * referring cgroups list. Additional cgroups will fail<br>
> + * to add themselves on is_running check.<br>
> + */<br>
> + ve_remove_referring_cgroups(ve);<br>
> +<br>
> + /*<br>
> * Neither it can be in pseudosuper state<br>
> * anymore, setup it again if needed.<br>
> */<br>
> @@ -704,6 +711,7 @@ do_init:<br>
> INIT_LIST_HEAD(&ve->devices);<br>
> INIT_LIST_HEAD(&ve->ve_list);<br>
> INIT_LIST_HEAD(&ve->devmnt_list);<br>
> + INIT_LIST_HEAD(&ve->referring_cgroups);<br>
> mutex_init(&ve->devmnt_mutex);<br>
> <br>
> #ifdef CONFIG_AIO<br>
> @@ -1713,3 +1721,68 @@ int ve_get_cpu_stat(struct ve_struct *ve, struct kernel_cpustat *kstat)<br>
> }<br>
> EXPORT_SYMBOL(ve_get_cpu_stat);<br>
> #endif /* CONFIG_CGROUP_SCHED */<br>
> +<br>
> +/*<br>
> + * Add cgroup to list of cgroups that hold a pointer to this<br>
> + * ve. List and ve is protected by lock.<br>
> + */<br>
> +void ve_add_referring_cgroup(struct ve_struct *ve, struct cgroup *cgrp)<br>
> +{<br>
> + /*<br>
> + * The only case when we are adding a cgroup and ve isn't<br>
> + * running is during or after ve_stop_ns. And it's already<br>
> + * late to add.<br>
> + */<br>
> + if (!ve->is_running)<br>
> + return;<br>
> +<br>
> + /*<br>
> + * ve_owner is protected by lock, so that cgroup could safely<br>
> + * call get_ve if ve_owner is not NULL<br>
> + */<br>
> + mutex_lock(&cgrp->ve_owner_mutex);<br>
> + cgrp->ve_owner = ve;<br>
> + mutex_unlock(&cgrp->ve_owner_mutex);<br>
> +<br>
> + spin_lock(&ve->referring_cgroups_lock);<br>
> + list_add_tail(&ve->referring_cgroups, &cgrp->ve_reference);<br>
> + spin_unlock(&ve->referring_cgroups_lock);<br>
<br>
How does this work?<br>
<br>
ve is entry here, and you link it to ve_reference list.<br>
In case of this function is called for another cgroup from the same ve,<br>
ve will be linked at another list. How can it be linked to two lists<br>
at the same time?<br>
<br>
> +}<br>
> +<br>
> +/*<br>
> + * ve_remove_referring_cgroups is called at destruction of VE to<br>
> + * eliminate<br>
> + */<br>
> +void ve_remove_referring_cgroups(struct ve_struct *ve)<br>
> +{<br>
> + LIST_HEAD(list);<br>
> + struct list_head *pos, *next;<br>
> + BUG_ON(ve->is_running);<br>
> +<br>
> + /*<br>
> + * Put ve list on stack to resolve cross locking of<br>
> + * ve_owner_mutex and referring_cgroups_lock.<br>
> + */<br>
> + spin_lock(&ve->referring_cgroups_lock);<br>
> + list_splice_init(&ve->referring_cgroups, &list);<br>
> + spin_unlock(&ve->referring_cgroups_lock);<br>
> +<br>
> + list_for_each_safe(pos, next, &list) {<br>
> + struct cgroup *cgrp;<br>
> + cgrp = list_entry(pos, struct cgroup, ve_reference);<br>
> + list_del(&cgrp->ve_reference);<br>
> + /*<br>
> + * This lock protects use of ve list on cgroup side.<br>
> + * 1. cgroup can remove itself from ve's list, then<br>
> + * it needs to get valid ve owner pointer first<br>
> + * find itself in ve list, without lock, it can get<br>
> + * a valid pointer and then start to work with a<br>
> + * freed ve structure.<br>
> + * 2. cgroup may want to get_ve at some point, same<br>
> + * logic as in 1. applies to this case.<br>
> + */<br>
> + mutex_lock(&cgrp->ve_owner_mutex);<br>
> + cgrp->ve_owner = NULL;<br>
> + mutex_unlock(&cgrp->ve_owner_mutex);<br>
> + }<br>
> +}<br>
> <br>
<br>
</div>
</span></font></div>
</body>
</html>