[Devel] [PATCH RHEL7 COMMIT] ve: fix task_ve use-after-free from reading /proc/pid/stat
Konstantin Khorenko
khorenko at virtuozzo.com
Fri Jul 12 13:08:17 MSK 2024
The commit is pushed to "branch-rh7-3.10.0-1160.119.1.vz7.224.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-1160.119.1.vz7.224.1
------>
commit e870cc6d153bed0b4df08e4ca0438b274c4bbb53
Author: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
Date: Tue Jul 9 16:48:21 2024 +0800
ve: fix task_ve use-after-free from reading /proc/pid/stat
A task can exit and become a zombie, and outlive its ve. Because exited
task is moved into init_css_set, so it does not hold any reference back
to ve.
More specifically that can happen in two cases:
1) If task is not inside pid namespace of ve and zap_pid_ns_processes
will not reap/collect the task;
2) If task is an init of root pid namespace of ve, nothing guarantees
that it can't become a zombie;
And if someone reads e.g.: /proc/pid/stat for such a task it may lead to
a crash on accessing freed ->task_ve memory.
https://virtuozzo.atlassian.net/browse/PSBM-157285
Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
---
kernel/ve/ve.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 70e588b7df4c..ff81fce4adcd 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -839,6 +839,14 @@ void ve_exit_ns(struct pid_namespace *pid_ns)
if (!ve->ve_ns || ve->ve_ns->pid_ns != pid_ns)
return;
+ /*
+ * To prevent use-after-free (e.g. reading /proc/pid/stat) in case
+ * zombie container init outlives the ve_struct.
+ *
+ * Note: synchronize_rcu in ve_drop_context makes it rcu safe.
+ */
+ rcu_assign_pointer(current->task_ve, &ve0);
+
cgroup_unmark_ve_roots(ve);
ve_workqueue_stop(ve);
@@ -1115,6 +1123,32 @@ static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset)
on_each_cpu(ve_update_cpuid_faulting, NULL, 1);
}
+/*
+ * When a container task exits it is moved into init_css_set, see cgroup_exit.
+ * So the task effectively puts its reference to ve_struct of the container,
+ * and now nothing prevents ve_struct from being freed while the task is still
+ * there.
+ *
+ * We need to make sure that this task's ->task_ve is not pointing to the
+ * container ve before putting the reference, else the use of ->task_ve from
+ * the task can lead to use-after-free (e.g. reading /proc/pid/stat).
+ *
+ * One exception to that rule is the init process of the container, it needs
+ * ->task_ve in ve_stop_ns / ve_exit_ns, and holds a reference to ve_struct.
+ */
+static void ve_exit(struct cgroup *cg, struct cgroup *oldcg, struct task_struct *task)
+{
+ struct ve_struct *ve = task->task_ve;
+
+ /*
+ * Pairs with synchronize_rcu in ve_grab_context and ve_drop_context
+ */
+ rcu_read_lock();
+ if (!ve->ve_ns || ve->init_task != task)
+ rcu_assign_pointer(task->task_ve, &ve0);
+ rcu_read_unlock();
+}
+
static int ve_state_read(struct cgroup *cg, struct cftype *cft,
struct seq_file *m)
{
@@ -1898,6 +1932,7 @@ struct cgroup_subsys ve_subsys = {
.css_free = ve_destroy,
.can_attach = ve_can_attach,
.attach = ve_attach,
+ .exit = ve_exit,
.base_cftypes = ve_cftypes,
};
EXPORT_SYMBOL(ve_subsys);
More information about the Devel
mailing list