[Devel] Re: [PATCH] pidns: Fix wait for zombies to be reaped in zap_pid_ns_processes

Tue Jul 13 14:42:35 PDT 2010

Hi Eric,

On 12/07/10 11:09 -0700, Eric W. Biederman wrote:
> 
> Fix zap_pid_ns_processes so that it successfully waits for all of
> the tasks in the pid namespace to be reaped, even if called for a
> non-leader task of the init process.  This guarantees that no task
> can escpae the pid namespace, and that it is safe for proc_flush_task
> to put the proc_mnt of the dead pid_namespace when pid 1 of the
> pid namespace calls proc_flush_task.
> 
> Before zap_pid_ns_processes was fixed to wait for all zombies
> in the pid namespace to be reaped the easiest way to generate
> a zombie that would escape the pid namespace would be to attach
> a debugger to a process inside a pidnamespace from outside the
> pid namespace and then exit the pid namespace.
> 
> In the process of trying to fix this bug I have looked at a lot
> of different options and a lot of different directions we can
> go.  There are several limiting factors.
> 
> - We need to guarantee that the init process of a pid namespace
>   is not reaped before every other task in the pid namespace is
>   reaped.  Wait succeeding on the init process of a pid namespace
>   gives the guarantee that all processes in the pid namespace
>   are dead and gone.  Or more succinctly it is not possible to
>   escape from a pid namespace.
> 
>   The previous behaviour where some zombies could escape the pid
>   namespace violates the assumption made by some reapers of a pid
>   namespace init that all of the pid namespace cleanup has completed
>   by the time that init is reaped.
> 
> - proc_flush_task needs to be called after each task is reaped.
>   Tasks are volatile and applications like top and ps frequently
>   touch every thread group directory in /proc which triggers dcache
>   entries to be created.  If we don't remove those dcache entries
>   when tasks are reaped we can get a large build up of useless
>   inodes and dentries.  Shrink_slab is designed to flush out useful
>   cache entries not useless ones so while in the big picture it doesn't
>   hurt if we leak a few if we leak a lot of dentries we put unnatural
>   pressure on the kernels memory managment.
> 
>   I sat down and attempted to measure the cost of calling
>   proc_flush_task with lat_tcp (from lmbench) and I get the same
>   range of latency readings wether or not proc_flush_task is
>   called.  Which tells me the runtime cost of the existing
>   proc_flush_task is in the noise.
> 
>   By running find /proc/ > /dev/null with proc_flush_task
>   disabled and then examining the counts in the slabcache
>   I managed to see us growing about 84 proc_inodes per
>   iteration, which is pretty horrific.  With proc_flush_task
>   enabled I don't see steady growth in any of the slab caches.
> 
> - Mounts of the /proc need a referenece to the pid namespace
>   that doesn't go away until /proc is unmounted.  Without
>   holding a reference to the pid namespace that lasts until
>   a /proc is unmounted it just isn't safe to lookup and display
>   pids in a particular pid_namespace.
> 
> - The pid_namespace needs to be able to access proc_mnt until
>   the at least the last call of proc_flush_task, for a
>   pid_namespace.
> 
>   Currently there is a the circular reference between proc_mnt
>   and the pid_namespace that we break very carefully through
>   an interaction of zap_pid_ns_processes, and proc_flush_task.
>   That clever interaction going wrong is what caused oopses
>   that led us to realize we have a problem.
> 
>   Even if we fix the pid_namespace and the proc_mnt to have
>   conjoined lifetimes and the oopses are fixed we still have
>   the problem that zombie processes can escape the pid namespace.
>   Which appears to be a problem for people using pid_namespaces
>   as inescapable process containers.
> 
> - fork/exec/waitpid is a common kernel path and as such we need
>   to keep the runtime costs down.  Which means as much as possible
>   we want to keep from adding code (especially expensive code)
>   into the fork/exec/waitpid code path.
> 
>   Changing zap_pid_ns_processes to fix the problem instead of
>   changing the code elsewhere is one of the few solutions I have
>   seen that does not increase the cost of the lat_proc test from
>   lmbench.

This patch looks like it is working (only a small RCU issue shown below). I
couldn't try it yet though.

I must admit that I am using a similar back-off solution in Kerrighed (not to
solve the issue of proc_flush_task(), but for one of the reasons that you stated
above: we want to be sure that all tasks of the namespace have been reaped), but
I considered it too ugly to propose it for Linux ;)

That said, this is probably the least intrusive solution we have seen yet.

> 
> Reported-by: Louis Rilling <Louis.Rilling at kerlabs.com>
> Signed-off-by: Eric W. Biederman <ebiederm at xmission.com>
> ---
>  kernel/pid_namespace.c |   50 ++++++++++++++++++++++++++++++++++++-----------
>  1 files changed, 38 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index a5aff94..aaf2ab0 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -139,16 +139,20 @@ void free_pid_ns(struct kref *kref)
>  
>  void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  {
> +	struct task_struct *me = current;
>  	int nr;
>  	int rc;
>  	struct task_struct *task;
>  
>  	/*
> -	 * The last thread in the cgroup-init thread group is terminating.
> -	 * Find remaining pid_ts in the namespace, signal and wait for them
> -	 * to exit.
> +	 * The last task in the pid namespace-init threa group is terminating.

s/threa/thread/

> +	 * Find remaining pids in the namespace, signal and wait for them
> +	 * to to be reaped.
>  	 *
> -	 * Note:  This signals each threads in the namespace - even those that
> +	 * By waiting for all of the tasks to be reaped before init is reaped
> +	 * we provide the invariant that no task can escape the pid namespace.
> +	 *
> +	 * Note:  This signals each task in the namespace - even those that
>  	 * 	  belong to the same thread group, To avoid this, we would have
>  	 * 	  to walk the entire tasklist looking a processes in this
>  	 * 	  namespace, but that could be unnecessarily expensive if the
> @@ -157,28 +161,50 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  	 *
>  	 */
>  	read_lock(&tasklist_lock);
> -	nr = next_pidmap(pid_ns, 1);
> -	while (nr > 0) {
> -		rcu_read_lock();
> +	for (nr = next_pidmap(pid_ns, 0); nr > 0; nr = next_pidmap(pid_ns, nr)) {
>  
>  		/*
>  		 * Any nested-container's init processes won't ignore the
>  		 * SEND_SIG_NOINFO signal, see send_signal()->si_fromuser().
>  		 */
> -		task = pid_task(find_vpid(nr), PIDTYPE_PID);
> -		if (task)
> +		rcu_read_lock();
> +		task = pid_task(find_pid_ns(nr, pid_ns), PIDTYPE_PID);
> +		if (task && !same_thread_group(task, me))
>  			send_sig_info(SIGKILL, SEND_SIG_NOINFO, task);
> -
>  		rcu_read_unlock();
> -
> -		nr = next_pidmap(pid_ns, nr);
>  	}
>  	read_unlock(&tasklist_lock);
>  
> +	/* Nicely reap all of the remaining children in the namespac */

s/namespac/namespace/

>  	do {
>  		clear_thread_flag(TIF_SIGPENDING);
>  		rc = sys_wait4(-1, NULL, __WALL, NULL);
>  	} while (rc != -ECHILD);
> +       
> +
> +repeat:
> +	/* Brute force wait for any remaining tasks to pass unhash_process
> +	 * in release_task.  Once a task has passed unhash_process there
> +	 * is no pid_namespace state left and they can be safely ignored.
> +	 */
> +	for (nr = next_pidmap(pid_ns, 1); nr > 0; nr = next_pidmap(pid_ns, nr)) {
> +
> +		/* Are there any tasks alive in this pid namespace */
> +		rcu_read_lock();
> +		task = pid_task(find_pid_ns(nr, pid_ns), PIDTYPE_PID);
> +		rcu_read_unlock();
> +		if (task && !same_thread_group(task, me)) {

rcu_read_unlock() should not be called before here, or task may be freed before
calling same_thread_group().

> +			clear_thread_flag(TIF_SIGPENDING);
> +			schedule_timeout_interruptible(HZ/10);
> +			goto repeat;
> +		}
> +	}
> +	/* At this point there are at most two tasks in the pid namespace.
> +	 * These tasks are our current task, and if we aren't pid 1 the zombie
> +	 * of pid 1. In either case pid 1 will be the final task reaped in this
> +	 * pid namespace, as non-leader threads are self reaping and leaders
> +	 * cannot be reaped until all of their siblings have been reaped.
> +	 */
>  
>  	acct_exit_ns(pid_ns);
>  	return;
> -- 
> 1.6.5.2.143.g8cc62

Thanks,

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <http://lists.openvz.org/pipermail/devel/attachments/20100713/ff4ab5f0/attachment-0001.sig>
-------------- next part --------------
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers