[Devel] [PATCH RHEL7 COMMIT] ve/ns: Port diff-ve-ns-allow-create-new-pid-ipc-and-utc-namespaces

Konstantin Khorenko khorenko at virtuozzo.com
Wed Jun 24 03:41:25 PDT 2015


The commit is pushed to "branch-rh7-3.10.0-123.1.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.17
------>
commit fde27c9b69a854d34c72988528601b6973b1312f
Author: Vladimir Davydov <vdavydov at parallels.com>
Date:   Wed Jun 24 14:41:25 2015 +0400

    ve/ns: Port diff-ve-ns-allow-create-new-pid-ipc-and-utc-namespaces
    
    Author: Pavel Tikhomirov
    Email: ptikhomirov at parallels.com
    Subject: ve: allow create new pid, ipc and utc namespaces.
    Date: Thu, 25 Dec 2014 17:57:21 +0300
    
    We already allow use it for CAP_SYS_ADMIN user, allow also to
    CAP_VE_SYS_ADMIN
    
    Work if IPC, PID and UTS can be nested. Explanation:
    
    @@ UTS namespace @@
    
    If we clone with flag CLONE_NEWUTS, new uts_namespace structure is
    allocated and put in nsproxy of the new task. All API accessing
    those names goes through this new copy uts_namespace struct:
    + uname (through uts_ns->name, utsname() function, using uts-namespace
    structures), + newuname, + getdomainname, + gethostname, +
    setdomainname, + sethostname.
    
    We can allow nested uts-namespace because they do not intersect with
    each other, one can not access uname of another uts-namespace. They
    won't be realy nested more like independent.
    
    @@ IPC namespace @@
    @ System V IPC @
    When we clone process with CLONE_NEWIPC, struct ipc_namespace is
    created and is put in nsproxy of new task, also separate structures idr
    are created for ids of IPC semaphores, message queues and shared memory
    are created for that task. And syscalls are aware of them: + ipcget,
    + msgsnd, + msgctl.
    
    One can access only objects of it's ipc-namespace, so this part is
    nested well.
    
    The following /proc interfaces are distinct in each IPC namespace:
    
    The System V IPC interfaces in /proc/sys/kernel, namely: msgmax, msgmnb,
    msgmni, sem, shmall, shmmax, shmmni, and shm_rmid_forced,
    because proc_ipc_dointvec is namespace aware and one can
    
    The System V IPC interfaces in /proc/sysvipc, because sysvipc_proc_seqops
    are aware of ipc-namespace
    
    @ POSIX message queues @
    New POSIX message queue filesystem allocated and registred, it's
    syscalls are ipc-namespace aware too: + mq_open, + mq_unlink.
    
    The following /proc interfaces are distinct in each IPC namespace:
    The POSIX message queue interfaces in /proc/sys/fs/mqueue
    because proc_mq_dointvec is namespace aware
    
    IPC namespaces are ready for nesting: structures, syscalls, proc.
    
    @@ PID namespace @@
    When we clone process with CLONE_NEWPID, struct pid_namespace will be
    created and allocated(same for pidmap), and will be put in tasks
    nsproxy. Syscalls do access pid through nsproxy of task:
    + getpid
    + wait4
    
    For new process task2, new pids for every pid-namespace in the hierarchy
    are allocated, task1 from pid-ns1 can view task2 from pid-ns2 if
    ns1==ns2 or ns1 is an ancestor of ns2, task1 will see task2's pid on
    level of ns1.
    
    If mount procfs from new pid-namespace it will list only pids which are
    in this namespace, according the level of this namespace.
    proc_pid_readdir { ns = filp->f_dentry->d_sb->s_fs_info;} // == ns
    proc_get_sb -> sget -> proc_set_super { sb->s_fs_info = get_pid_ns(ns);}
    [it is when mount proc from pid-namespace vfs_kern_mount -> proc_get_sb]
    
    Can we accidently kill 'init'? - No(+), for child_reaper of namespace
    SIGNAL_UNKILLABLE flag for signals is set. All signals except SIGKILL
    and SIGSTOP will be ignored by it.
    In send_signal from_ancestor_ns is determined, than according to it in
    sig_task_ignored if signal is not from ancestor namespace and task has
    SIGNAL_UNKILLABLE and its handler is default signal is ignored.
    
    So pid namespace is fully nested: structures, syscalls, proc, signals.
    
    Signed-off-by: Pavel Tikhomirov <ptikhomirov at parallels.com>
    
    Acked-by: Pavel Emelyanov <xemul at parallels.com>
    =============================================================================
    
    While we are here, zap the force_admin argument of copy_namespaces,
    because it does not make sense anymore, plus drop get_task_namespaces,
    which is unused.
    
    Related to https://jira.sw.ru/browse/PSBM-33650
    
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
---
 include/linux/nsproxy.h |  2 +-
 kernel/fork.c           |  2 +-
 kernel/nsproxy.c        | 31 +++++--------------------------
 kernel/ve/vecalls.c     |  2 +-
 4 files changed, 8 insertions(+), 29 deletions(-)

diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 9d529ab..493c701 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -62,7 +62,7 @@ static inline struct nsproxy *task_nsproxy(struct task_struct *tsk)
 	return rcu_dereference(tsk->nsproxy);
 }
 
-int copy_namespaces(unsigned long flags, struct task_struct *tsk, int force_admin);
+int copy_namespaces(unsigned long flags, struct task_struct *tsk);
 void exit_task_namespaces(struct task_struct *tsk);
 void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
 void free_nsproxy(struct nsproxy *ns);
diff --git a/kernel/fork.c b/kernel/fork.c
index 911dcc3..5e03c7d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1401,7 +1401,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	retval = copy_mm(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_signal;
-	retval = copy_namespaces(clone_flags, p, 0);
+	retval = copy_namespaces(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_mm;
 	retval = copy_io(clone_flags, p);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 79983ab..81402a8 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -41,14 +41,6 @@ struct nsproxy init_nsproxy = {
 #endif
 };
 
-void get_task_namespaces(struct task_struct *tsk)
-{
-	struct nsproxy *ns = tsk->nsproxy;
-	if (ns) {
-		get_nsproxy(ns);
-	}
-}
-
 static inline struct nsproxy *create_nsproxy(void)
 {
 	struct nsproxy *nsproxy;
@@ -128,8 +120,7 @@ out_ns:
  * called from clone.  This now handles copy for nsproxy and all
  * namespaces therein.
  */
-int copy_namespaces(unsigned long flags, struct task_struct *tsk,
-		int force_admin)
+int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 {
 	struct nsproxy *old_ns = tsk->nsproxy;
 	struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
@@ -145,18 +136,10 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk,
 				CLONE_NEWPID | CLONE_NEWNET)))
 		return 0;
 
-	if (!force_admin) {
-		if (!ns_capable(user_ns, CAP_SYS_ADMIN) &&
-		    !ns_capable(user_ns, CAP_VE_SYS_ADMIN)) {
-			err = -EPERM;
-			goto out;
-		}
-
-		if (!ns_capable(user_ns, CAP_SYS_ADMIN) &&
-		    (flags & (CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWNET))) {
-			err = -EPERM;
-			goto out;
-		}
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN) &&
+	    !ns_capable(user_ns, CAP_VE_SYS_ADMIN)) {
+		err = -EPERM;
+		goto out;
 	}
 
 	/*
@@ -219,10 +202,6 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 		!ns_capable(user_ns, CAP_VE_SYS_ADMIN))
 		return -EPERM;
 
-	if (!ns_capable(user_ns, CAP_SYS_ADMIN) &&
-	    (unshare_flags & (CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWNET)))
-		return -EPERM;
-
 	*new_nsp = create_new_namespaces(unshare_flags, current, user_ns,
 					 new_fs ? new_fs : current->fs);
 	if (IS_ERR(*new_nsp)) {
diff --git a/kernel/ve/vecalls.c b/kernel/ve/vecalls.c
index e2c9021..e262c5e 100644
--- a/kernel/ve/vecalls.c
+++ b/kernel/ve/vecalls.c
@@ -195,7 +195,7 @@ static inline int init_ve_namespaces(void)
 
 	err = copy_namespaces(CLONE_NEWUTS | CLONE_NEWIPC |
 			      CLONE_NEWPID | CLONE_NEWNET,
-			      current, 1);
+			      current);
 	if (err < 0)
 		return err;
 



More information about the Devel mailing list