[Devel] [PATCH rh7 01/14] Port diff-ve-ns-allow-create-new-pid-ipc-and-utc-namespaces

Tue Jun 23 09:29:37 PDT 2015

Author: Pavel Tikhomirov
Email: ptikhomirov at parallels.com
Subject: ve: allow create new pid, ipc and utc namespaces.
Date: Thu, 25 Dec 2014 17:57:21 +0300

We already allow use it for CAP_SYS_ADMIN user, allow also to
CAP_VE_SYS_ADMIN

Work if IPC, PID and UTS can be nested. Explanation:

@@ UTS namespace @@

If we clone with flag CLONE_NEWUTS, new uts_namespace structure is
allocated and put in nsproxy of the new task. All API accessing
those names goes through this new copy uts_namespace struct:
+ uname (through uts_ns->name, utsname() function, using uts-namespace
structures), + newuname, + getdomainname, + gethostname, +
setdomainname, + sethostname.

We can allow nested uts-namespace because they do not intersect with
each other, one can not access uname of another uts-namespace. They
won't be realy nested more like independent.

@@ IPC namespace @@
@ System V IPC @
When we clone process with CLONE_NEWIPC, struct ipc_namespace is
created and is put in nsproxy of new task, also separate structures idr
are created for ids of IPC semaphores, message queues and shared memory
are created for that task. And syscalls are aware of them: + ipcget,
+ msgsnd, + msgctl.

One can access only objects of it's ipc-namespace, so this part is
nested well.

The following /proc interfaces are distinct in each IPC namespace:

The System V IPC interfaces in /proc/sys/kernel, namely: msgmax, msgmnb,
msgmni, sem, shmall, shmmax, shmmni, and shm_rmid_forced,
because proc_ipc_dointvec is namespace aware and one can

The System V IPC interfaces in /proc/sysvipc, because sysvipc_proc_seqops
are aware of ipc-namespace

@ POSIX message queues @
New POSIX message queue filesystem allocated and registred, it's
syscalls are ipc-namespace aware too: + mq_open, + mq_unlink.

The following /proc interfaces are distinct in each IPC namespace:
The POSIX message queue interfaces in /proc/sys/fs/mqueue
because proc_mq_dointvec is namespace aware

IPC namespaces are ready for nesting: structures, syscalls, proc.

@@ PID namespace @@
When we clone process with CLONE_NEWPID, struct pid_namespace will be
created and allocated(same for pidmap), and will be put in tasks
nsproxy. Syscalls do access pid through nsproxy of task:
+ getpid
+ wait4

For new process task2, new pids for every pid-namespace in the hierarchy
are allocated, task1 from pid-ns1 can view task2 from pid-ns2 if
ns1==ns2 or ns1 is an ancestor of ns2, task1 will see task2's pid on
level of ns1.

If mount procfs from new pid-namespace it will list only pids which are
in this namespace, according the level of this namespace.
proc_pid_readdir { ns = filp->f_dentry->d_sb->s_fs_info;} // == ns
proc_get_sb -> sget -> proc_set_super { sb->s_fs_info = get_pid_ns(ns);}
[it is when mount proc from pid-namespace vfs_kern_mount -> proc_get_sb]

Can we accidently kill 'init'? - No(+), for child_reaper of namespace
SIGNAL_UNKILLABLE flag for signals is set. All signals except SIGKILL
and SIGSTOP will be ignored by it.
In send_signal from_ancestor_ns is determined, than according to it in
sig_task_ignored if signal is not from ancestor namespace and task has
SIGNAL_UNKILLABLE and its handler is default signal is ignored.

So pid namespace is fully nested: structures, syscalls, proc, signals.

Signed-off-by: Pavel Tikhomirov <ptikhomirov at parallels.com>
Acked-by: Pavel Emelyanov <xemul at parallels.com>
=============================================================================

While we are here, zap the force_admin argument of copy_namespaces,
because it does not make sense anymore, plus drop get_task_namespaces,
which is unused.

Related to https://jira.sw.ru/browse/PSBM-33650

Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
---
 include/linux/nsproxy.h |  2 +-
 kernel/fork.c           |  2 +-
 kernel/nsproxy.c        | 31 +++++--------------------------
 kernel/ve/vecalls.c     |  2 +-
 4 files changed, 8 insertions(+), 29 deletions(-)

diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 9d529abd9055..493c701e3e9c 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -62,7 +62,7 @@ static inline struct nsproxy *task_nsproxy(struct task_struct *tsk)
 	return rcu_dereference(tsk->nsproxy);
 }
 
-int copy_namespaces(unsigned long flags, struct task_struct *tsk, int force_admin);
+int copy_namespaces(unsigned long flags, struct task_struct *tsk);
 void exit_task_namespaces(struct task_struct *tsk);
 void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
 void free_nsproxy(struct nsproxy *ns);
diff --git a/kernel/fork.c b/kernel/fork.c
index 911dcc384638..5e03c7d6e9e3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1401,7 +1401,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	retval = copy_mm(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_signal;
-	retval = copy_namespaces(clone_flags, p, 0);
+	retval = copy_namespaces(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_mm;
 	retval = copy_io(clone_flags, p);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 79983abdf563..81402a84257b 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -41,14 +41,6 @@ struct nsproxy init_nsproxy = {
 #endif
 };
 
-void get_task_namespaces(struct task_struct *tsk)
-{
-	struct nsproxy *ns = tsk->nsproxy;
-	if (ns) {
-		get_nsproxy(ns);
-	}
-}
-
 static inline struct nsproxy *create_nsproxy(void)
 {
 	struct nsproxy *nsproxy;
@@ -128,8 +120,7 @@ out_ns:
  * called from clone.  This now handles copy for nsproxy and all
  * namespaces therein.
  */
-int copy_namespaces(unsigned long flags, struct task_struct *tsk,
-		int force_admin)
+int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 {
 	struct nsproxy *old_ns = tsk->nsproxy;
 	struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
@@ -145,18 +136,10 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk,
 				CLONE_NEWPID | CLONE_NEWNET)))
 		return 0;
 
-	if (!force_admin) {
-		if (!ns_capable(user_ns, CAP_SYS_ADMIN) &&
-		    !ns_capable(user_ns, CAP_VE_SYS_ADMIN)) {
-			err = -EPERM;
-			goto out;
-		}
-
-		if (!ns_capable(user_ns, CAP_SYS_ADMIN) &&
-		    (flags & (CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWNET))) {
-			err = -EPERM;
-			goto out;
-		}
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN) &&
+	    !ns_capable(user_ns, CAP_VE_SYS_ADMIN)) {
+		err = -EPERM;
+		goto out;
 	}
 
 	/*
@@ -219,10 +202,6 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 		!ns_capable(user_ns, CAP_VE_SYS_ADMIN))
 		return -EPERM;
 
-	if (!ns_capable(user_ns, CAP_SYS_ADMIN) &&
-	    (unshare_flags & (CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWNET)))
-		return -EPERM;
-
 	*new_nsp = create_new_namespaces(unshare_flags, current, user_ns,
 					 new_fs ? new_fs : current->fs);
 	if (IS_ERR(*new_nsp)) {
diff --git a/kernel/ve/vecalls.c b/kernel/ve/vecalls.c
index e2c9021b63a2..e262c5ed56f6 100644
--- a/kernel/ve/vecalls.c
+++ b/kernel/ve/vecalls.c
@@ -195,7 +195,7 @@ static inline int init_ve_namespaces(void)
 
 	err = copy_namespaces(CLONE_NEWUTS | CLONE_NEWIPC |
 			      CLONE_NEWPID | CLONE_NEWNET,
-			      current, 1);
+			      current);
 	if (err < 0)
 		return err;
 
-- 
2.1.4