[Devel] Re: [PATCH 1/2] namespaces: introduce sys_hijack (v10)
Serge E. Hallyn
serue at us.ibm.com
Tue Nov 27 08:11:32 PST 2007
Quoting Crispin Cowan (crispin at crispincowan.com):
> Just the name "sys_hijack" makes me concerned.
>
> This post describes a bunch of "what", but doesn't tell us about "why"
> we would want this. What is it for?
Please see my response to Casey's email.
> And I second Casey's concern about careful management of the privilege
> required to "hijack" a process.
Absolutely. We're definately still in RFC territory.
Note that there are currently several proposed (but no upstream) ways to
accomplish entering a namespace:
1. bind_ns() is a new pair of syscalls proposed by Cedric. An
nsproxy is given an integer id. The id can be used to enter
an nsproxy, basically a straight current->nsproxy = target_nsproxy;
2. I had previously posted a patchset on top of the nsproxy
cgroup which allowed entering a nsproxy through the ns cgroup
interface.
There are objections to both those patchsets because simply switching a
task's nsproxy using a syscall or file write in the middle of running a
binary is quite unsafe. Eric Biederman had suggested using ptrace or
something like it to accomplish the goal.
Just using ptrace is however not safe either. You are inheriting *all*
of the target's context, so it shouldn't be difficult for a nefarious
container/vserver admin to trick the host admin into running something
which gives the container/vserver admin full access to the host.
That's where the hijack idea came from. Yes, I called it hijack to make
sure alarm bells went off :) bc it's definately still worrisome. But at
this point I believe it is the safest solution suggested so far.
-serge
> Crispin
>
> Mark Nelson wrote:
> > Here's the latest version of sys_hijack.
> > Apologies for its lateness.
> >
> > Thanks!
> >
> > Mark.
> >
> > Subject: [PATCH 1/2] namespaces: introduce sys_hijack (v10)
> >
> > Move most of do_fork() into a new do_fork_task() which acts on
> > a new argument, task, rather than on current. do_fork() becomes
> > a call to do_fork_task(current, ...).
> >
> > Introduce sys_hijack (for i386 and s390 only so far). It is like
> > clone, but in place of a stack pointer (which is assumed null) it
> > accepts a pid. The process identified by that pid is the one
> > which is actually cloned. Some state - including the file
> > table, the signals and sighand (and hence tty), and the ->parent
> > are taken from the calling process.
> >
> > A process to be hijacked may be identified by process id, in the
> > case of HIJACK_PID. Alternatively, in the case of HIJACK_CG an
> > open fd for a cgroup 'tasks' file may be specified. The first
> > available task in that cgroup will then be hijacked.
> >
> > HIJACK_NS is implemented as a third hijack method. The main
> > purpose is to allow entering an empty cgroup without having
> > to keep a task alive in the target cgroup. When HIJACK_NS
> > is called, only the cgroup and nsproxy are copied from the
> > cgroup. Security, user, and rootfs info is not retained
> > in the cgroups and so cannot be copied to the child task.
> >
> > In order to hijack a process, the calling process must be
> > allowed to ptrace the target.
> >
> > Sending sigstop to the hijacked task can trick its parent shell
> > (if it is a shell foreground task) into thinking it should retake
> > its tty.
> >
> > So try not sending SIGSTOP, and instead hold the task_lock over
> > the hijacked task throughout the do_fork_task() operation.
> > This is really dangerous. I've fixed cgroup_fork() to not
> > task_lock(task) in the hijack case, but there may well be other
> > code called during fork which can under "some circumstances"
> > task_lock(task).
> >
> > Still, this is working for me.
> >
> > The effect is a sort of namespace enter. The following program
> > uses sys_hijack to 'enter' all namespaces of the specified task.
> > For instance in one terminal, do
> >
> > mount -t cgroup -ons cgroup /cgroup
> > hostname
> > qemu
> > ns_exec -u /bin/sh
> > hostname serge
> > echo $$
> > 1073
> > cat /proc/$$/cgroup
> > ns:/node_1073
> >
> > In another terminal then do
> >
> > hostname
> > qemu
> > cat /proc/$$/cgroup
> > ns:/
> > hijack pid 1073
> > hostname
> > serge
> > cat /proc/$$/cgroup
> > ns:/node_1073
> > hijack cgroup /cgroup/node_1073/tasks
> >
> > Changelog:
> > Aug 23: send a stop signal to the hijacked process
> > (like ptrace does).
> > Oct 09: Update for 2.6.23-rc8-mm2 (mainly pidns)
> > Don't take task_lock under rcu_read_lock
> > Send hijacked process to cgroup_fork() as
> > the first argument.
> > Removed some unneeded task_locks.
> > Oct 16: Fix bug introduced into alloc_pid.
> > Oct 16: Add 'int which' argument to sys_hijack to
> > allow later expansion to use cgroup in place
> > of pid to specify what to hijack.
> > Oct 24: Implement hijack by open cgroup file.
> > Nov 02: Switch copying of task info: do full copy
> > from current, then copy relevant pieces from
> > hijacked task.
> > Nov 06: Verbatim task_struct copy now comes from current,
> > after which copy_hijackable_taskinfo() copies
> > relevant context pieces from the hijack source.
> > Nov 07: Move arch-independent hijack code to kernel/fork.c
> > Nov 07: powerpc and x86_64 support (Mark Nelson)
> > Nov 07: Don't allow hijacking members of same session.
> > Nov 07: introduce cgroup_may_hijack, and may_hijack hook to
> > cgroup subsystems. The ns subsystem uses this to
> > enforce the rule that one may only hijack descendent
> > namespaces.
> > Nov 07: s390 support
> > Nov 08: don't send SIGSTOP to hijack source task
> > Nov 10: cache reference to nsproxy in ns cgroup for use in
> > hijacking an empty cgroup.
> > Nov 10: allow partial hijack of empty cgroup
> > Nov 13: don't double-get cgroup for hijack_ns
> > find_css_set() actually returns the set with a
> > reference already held, so cgroup_fork_fromcgroup()
> > by doing a get_css_set() was getting a second
> > reference. Therefore after exiting the hijack
> > task we could not rmdir the csgroup.
> > Nov 22: temporarily remove x86_64 and powerpc support
> > Nov 27: rebased on 2.6.24-rc3
> >
> > ==============================================================
> > hijack.c
> > ==============================================================
> > /*
> > * Your options are:
> > * hijack pid 1078
> > * hijack cgroup /cgroup/node_1078/tasks
> > * hijack ns /cgroup/node_1078/tasks
> > */
> >
> > #define _BSD_SOURCE
> > #include <unistd.h>
> > #include <sys/syscall.h>
> > #include <sys/types.h>
> > #include <sys/wait.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > #include <sched.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> >
> > #if __i386__
> > # define __NR_hijack 325
> > #elif __s390x__
> > # define __NR_hijack 319
> > #else
> > # error "Architecture not supported"
> > #endif
> >
> > #ifndef CLONE_NEWUTS
> > #define CLONE_NEWUTS 0x04000000
> > #endif
> >
> > void usage(char *me)
> > {
> > printf("Usage: %s pid <pid>\n", me);
> > printf(" | %s cgroup <cgroup_tasks_file>\n", me);
> > printf(" | %s ns <cgroup_tasks_file>\n", me);
> > exit(1);
> > }
> >
> > int exec_shell(void)
> > {
> > execl("/bin/sh", "/bin/sh", NULL);
> > }
> >
> > #define HIJACK_PID 1
> > #define HIJACK_CG 2
> > #define HIJACK_NS 3
> >
> > int main(int argc, char *argv[])
> > {
> > int id;
> > int ret;
> > int status;
> > int which_hijack;
> >
> > if (argc < 3 || !strcmp(argv[1], "-h"))
> > usage(argv[0]);
> > if (strcmp(argv[1], "cgroup") == 0)
> > which_hijack = HIJACK_CG;
> > else if (strcmp(argv[1], "ns") == 0)
> > which_hijack = HIJACK_NS;
> > else
> > which_hijack = HIJACK_PID;
> >
> > switch(which_hijack) {
> > case HIJACK_PID:
> > id = atoi(argv[2]);
> > printf("hijacking pid %d\n", id);
> > break;
> > case HIJACK_CG:
> > case HIJACK_NS:
> > id = open(argv[2], O_RDONLY);
> > if (id == -1) {
> > perror("cgroup open");
> > return 1;
> > }
> > break;
> > }
> >
> > ret = syscall(__NR_hijack, SIGCHLD, which_hijack, (unsigned long)id);
> >
> > if (which_hijack != HIJACK_PID)
> > close(id);
> > if (ret == 0) {
> > return exec_shell();
> > } else if (ret < 0) {
> > perror("sys_hijack");
> > } else {
> > printf("waiting on cloned process %d\n", ret);
> > while(waitpid(-1, &status, __WALL) != -1)
> > ;
> > printf("cloned process exited with %d (waitpid ret %d)\n",
> > status, ret);
> > }
> >
> > return ret;
> > }
> > ==============================================================
> >
> > Signed-off-by: Serge Hallyn <serue at us.ibm.com>
> > Signed-off-by: Mark Nelson <markn at au1.ibm.com>
> > ---
> > Documentation/cgroups.txt | 9 +
> > arch/s390/kernel/process.c | 21 +++
> > arch/x86/kernel/process_32.c | 24 ++++
> > arch/x86/kernel/syscall_table_32.S | 1
> > include/asm-x86/unistd_32.h | 3
> > include/linux/cgroup.h | 28 ++++-
> > include/linux/nsproxy.h | 12 +-
> > include/linux/ptrace.h | 1
> > include/linux/sched.h | 19 +++
> > include/linux/syscalls.h | 2
> > kernel/cgroup.c | 133 +++++++++++++++++++++++-
> > kernel/fork.c | 201 ++++++++++++++++++++++++++++++++++---
> > kernel/ns_cgroup.c | 88 +++++++++++++++-
> > kernel/nsproxy.c | 4
> > kernel/ptrace.c | 7 +
> > 15 files changed, 523 insertions(+), 30 deletions(-)
> >
> > Index: upstream/arch/s390/kernel/process.c
> > ===================================================================
> > --- upstream.orig/arch/s390/kernel/process.c
> > +++ upstream/arch/s390/kernel/process.c
> > @@ -321,6 +321,27 @@ asmlinkage long sys_clone(void)
> > parent_tidptr, child_tidptr);
> > }
> >
> > +asmlinkage long sys_hijack(void)
> > +{
> > + struct pt_regs *regs = task_pt_regs(current);
> > + unsigned long sp = regs->orig_gpr2;
> > + unsigned long clone_flags = regs->gprs[3];
> > + int which = regs->gprs[4];
> > + unsigned int fd;
> > + pid_t pid;
> > +
> > + switch (which) {
> > + case HIJACK_PID:
> > + pid = regs->gprs[5];
> > + return hijack_pid(pid, clone_flags, *regs, sp);
> > + case HIJACK_CGROUP:
> > + fd = (unsigned int) regs->gprs[5];
> > + return hijack_cgroup(fd, clone_flags, *regs, sp);
> > + default:
> > + return -EINVAL;
> > + }
> > +}
> > +
> > /*
> > * This is trivial, and on the face of it looks like it
> > * could equally well be done in user mode.
> > Index: upstream/arch/x86/kernel/process_32.c
> > ===================================================================
> > --- upstream.orig/arch/x86/kernel/process_32.c
> > +++ upstream/arch/x86/kernel/process_32.c
> > @@ -37,6 +37,7 @@
> > #include <linux/personality.h>
> > #include <linux/tick.h>
> > #include <linux/percpu.h>
> > +#include <linux/cgroup.h>
> >
> > #include <asm/uaccess.h>
> > #include <asm/pgtable.h>
> > @@ -781,6 +782,29 @@ asmlinkage int sys_clone(struct pt_regs
> > return do_fork(clone_flags, newsp, ®s, 0, parent_tidptr, child_tidptr);
> > }
> >
> > +asmlinkage int sys_hijack(struct pt_regs regs)
> > +{
> > + unsigned long sp = regs.esp;
> > + unsigned long clone_flags = regs.ebx;
> > + int which = regs.ecx;
> > + unsigned int fd;
> > + pid_t pid;
> > +
> > + switch (which) {
> > + case HIJACK_PID:
> > + pid = regs.edx;
> > + return hijack_pid(pid, clone_flags, regs, sp);
> > + case HIJACK_CGROUP:
> > + fd = (unsigned int) regs.edx;
> > + return hijack_cgroup(fd, clone_flags, regs, sp);
> > + case HIJACK_NS:
> > + fd = (unsigned int) regs.edx;
> > + return hijack_ns(fd, clone_flags, regs, sp);
> > + default:
> > + return -EINVAL;
> > + }
> > +}
> > +
> > /*
> > * This is trivial, and on the face of it looks like it
> > * could equally well be done in user mode.
> > Index: upstream/arch/x86/kernel/syscall_table_32.S
> > ===================================================================
> > --- upstream.orig/arch/x86/kernel/syscall_table_32.S
> > +++ upstream/arch/x86/kernel/syscall_table_32.S
> > @@ -324,3 +324,4 @@ ENTRY(sys_call_table)
> > .long sys_timerfd
> > .long sys_eventfd
> > .long sys_fallocate
> > + .long sys_hijack /* 325 */
> > Index: upstream/Documentation/cgroups.txt
> > ===================================================================
> > --- upstream.orig/Documentation/cgroups.txt
> > +++ upstream/Documentation/cgroups.txt
> > @@ -495,6 +495,15 @@ LL=cgroup_mutex
> > Called after the task has been attached to the cgroup, to allow any
> > post-attachment activity that requires memory allocations or blocking.
> >
> > +int may_hijack(struct cgroup_subsys *ss, struct cgroup *cont,
> > + struct task_struct *task)
> > +LL=cgroup_mutex
> > +
> > +Called prior to hijacking a task. Current is cloning a new child
> > +which is hijacking cgroup, namespace, and security context from
> > +the target task. Called with the hijacked task locked. Return
> > +0 to allow.
> > +
> > void fork(struct cgroup_subsy *ss, struct task_struct *task)
> > LL=callback_mutex, maybe read_lock(tasklist_lock)
> >
> > Index: upstream/include/asm-x86/unistd_32.h
> > ===================================================================
> > --- upstream.orig/include/asm-x86/unistd_32.h
> > +++ upstream/include/asm-x86/unistd_32.h
> > @@ -330,10 +330,11 @@
> > #define __NR_timerfd 322
> > #define __NR_eventfd 323
> > #define __NR_fallocate 324
> > +#define __NR_hijack 325
> >
> > #ifdef __KERNEL__
> >
> > -#define NR_syscalls 325
> > +#define NR_syscalls 326
> >
> > #define __ARCH_WANT_IPC_PARSE_VERSION
> > #define __ARCH_WANT_OLD_READDIR
> > Index: upstream/include/linux/cgroup.h
> > ===================================================================
> > --- upstream.orig/include/linux/cgroup.h
> > +++ upstream/include/linux/cgroup.h
> > @@ -14,19 +14,23 @@
> > #include <linux/nodemask.h>
> > #include <linux/rcupdate.h>
> > #include <linux/cgroupstats.h>
> > +#include <linux/err.h>
> >
> > #ifdef CONFIG_CGROUPS
> >
> > struct cgroupfs_root;
> > struct cgroup_subsys;
> > struct inode;
> > +struct cgroup;
> >
> > extern int cgroup_init_early(void);
> > extern int cgroup_init(void);
> > extern void cgroup_init_smp(void);
> > extern void cgroup_lock(void);
> > extern void cgroup_unlock(void);
> > -extern void cgroup_fork(struct task_struct *p);
> > +extern void cgroup_fork(struct task_struct *parent, struct task_struct *p);
> > +extern void cgroup_fork_fromcgroup(struct cgroup *new_cg,
> > + struct task_struct *child);
> > extern void cgroup_fork_callbacks(struct task_struct *p);
> > extern void cgroup_post_fork(struct task_struct *p);
> > extern void cgroup_exit(struct task_struct *p, int run_callbacks);
> > @@ -236,6 +240,8 @@ struct cgroup_subsys {
> > void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cont);
> > int (*can_attach)(struct cgroup_subsys *ss,
> > struct cgroup *cont, struct task_struct *tsk);
> > + int (*may_hijack)(struct cgroup_subsys *ss,
> > + struct cgroup *cont, struct task_struct *tsk);
> > void (*attach)(struct cgroup_subsys *ss, struct cgroup *cont,
> > struct cgroup *old_cont, struct task_struct *tsk);
> > void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
> > @@ -304,12 +310,21 @@ struct task_struct *cgroup_iter_next(str
> > struct cgroup_iter *it);
> > void cgroup_iter_end(struct cgroup *cont, struct cgroup_iter *it);
> >
> > +struct cgroup *cgroup_from_fd(unsigned int fd);
> > +struct task_struct *task_from_cgroup_fd(unsigned int fd);
> > +int cgroup_may_hijack(struct task_struct *tsk);
> > #else /* !CONFIG_CGROUPS */
> > +struct cgroup {
> > +};
> >
> > static inline int cgroup_init_early(void) { return 0; }
> > static inline int cgroup_init(void) { return 0; }
> > static inline void cgroup_init_smp(void) {}
> > -static inline void cgroup_fork(struct task_struct *p) {}
> > +static inline void cgroup_fork(struct task_struct *parent,
> > + struct task_struct *p) {}
> > +static inline void cgroup_fork_fromcgroup(struct cgroup *new_cg,
> > + struct task_struct *child) {}
> > +
> > static inline void cgroup_fork_callbacks(struct task_struct *p) {}
> > static inline void cgroup_post_fork(struct task_struct *p) {}
> > static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
> > @@ -322,6 +337,15 @@ static inline int cgroupstats_build(stru
> > return -EINVAL;
> > }
> >
> > +static inline struct cgroup *cgroup_from_fd(unsigned int fd) { return NULL; }
> > +static inline struct task_struct *task_from_cgroup_fd(unsigned int fd)
> > +{
> > + return ERR_PTR(-EINVAL);
> > +}
> > +static inline int cgroup_may_hijack(struct task_struct *tsk)
> > +{
> > + return 0;
> > +}
> > #endif /* !CONFIG_CGROUPS */
> >
> > #endif /* _LINUX_CGROUP_H */
> > Index: upstream/include/linux/nsproxy.h
> > ===================================================================
> > --- upstream.orig/include/linux/nsproxy.h
> > +++ upstream/include/linux/nsproxy.h
> > @@ -3,6 +3,7 @@
> >
> > #include <linux/spinlock.h>
> > #include <linux/sched.h>
> > +#include <linux/err.h>
> >
> > struct mnt_namespace;
> > struct uts_namespace;
> > @@ -81,10 +82,17 @@ static inline void get_nsproxy(struct ns
> > atomic_inc(&ns->count);
> > }
> >
> > +struct cgroup;
> > #ifdef CONFIG_CGROUP_NS
> > -int ns_cgroup_clone(struct task_struct *tsk);
> > +int ns_cgroup_clone(struct task_struct *tsk, struct nsproxy *nsproxy);
> > +int ns_cgroup_verify(struct cgroup *cgroup);
> > +void copy_hijack_nsproxy(struct task_struct *tsk, struct cgroup *cgroup);
> > #else
> > -static inline int ns_cgroup_clone(struct task_struct *tsk) { return 0; }
> > +static inline int ns_cgroup_clone(struct task_struct *tsk,
> > + struct nsproxy *nsproxy) { return 0; }
> > +static inline int ns_cgroup_verify(struct cgroup *cgroup) { return 0; }
> > +static inline void copy_hijack_nsproxy(struct task_struct *tsk,
> > + struct cgroup *cgroup) {}
> > #endif
> >
> > #endif
> > Index: upstream/include/linux/ptrace.h
> > ===================================================================
> > --- upstream.orig/include/linux/ptrace.h
> > +++ upstream/include/linux/ptrace.h
> > @@ -97,6 +97,7 @@ extern void __ptrace_link(struct task_st
> > extern void __ptrace_unlink(struct task_struct *child);
> > extern void ptrace_untrace(struct task_struct *child);
> > extern int ptrace_may_attach(struct task_struct *task);
> > +extern int ptrace_may_attach_locked(struct task_struct *task);
> >
> > static inline void ptrace_link(struct task_struct *child,
> > struct task_struct *new_parent)
> > Index: upstream/include/linux/sched.h
> > ===================================================================
> > --- upstream.orig/include/linux/sched.h
> > +++ upstream/include/linux/sched.h
> > @@ -29,6 +29,13 @@
> > #define CLONE_NEWNET 0x40000000 /* New network namespace */
> >
> > /*
> > + * Hijack flags
> > + */
> > +#define HIJACK_PID 1 /* 'id' is a pid */
> > +#define HIJACK_CGROUP 2 /* 'id' is an open fd for a cgroup dir */
> > +#define HIJACK_NS 3 /* 'id' is an open fd for a cgroup dir */
> > +
> > +/*
> > * Scheduling policies
> > */
> > #define SCHED_NORMAL 0
> > @@ -1693,9 +1700,19 @@ extern int allow_signal(int);
> > extern int disallow_signal(int);
> >
> > extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
> > -extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
> > +extern long do_fork(unsigned long, unsigned long, struct pt_regs *,
> > + unsigned long, int __user *, int __user *);
> > struct task_struct *fork_idle(int);
> >
> > +extern int hijack_task(struct task_struct *task, unsigned long clone_flags,
> > + struct pt_regs regs, unsigned long sp);
> > +extern int hijack_pid(pid_t pid, unsigned long clone_flags, struct pt_regs regs,
> > + unsigned long sp);
> > +extern int hijack_cgroup(unsigned int fd, unsigned long clone_flags,
> > + struct pt_regs regs, unsigned long sp);
> > +extern int hijack_ns(unsigned int fd, unsigned long clone_flags,
> > + struct pt_regs regs, unsigned long sp);
> > +
> > extern void set_task_comm(struct task_struct *tsk, char *from);
> > extern void get_task_comm(char *to, struct task_struct *tsk);
> >
> > Index: upstream/include/linux/syscalls.h
> > ===================================================================
> > --- upstream.orig/include/linux/syscalls.h
> > +++ upstream/include/linux/syscalls.h
> > @@ -614,4 +614,6 @@ asmlinkage long sys_fallocate(int fd, in
> >
> > int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
> >
> > +asmlinkage long sys_hijack(unsigned long flags, int which, unsigned long id);
> > +
> > #endif
> > Index: upstream/kernel/cgroup.c
> > ===================================================================
> > --- upstream.orig/kernel/cgroup.c
> > +++ upstream/kernel/cgroup.c
> > @@ -44,6 +44,7 @@
> > #include <linux/kmod.h>
> > #include <linux/delayacct.h>
> > #include <linux/cgroupstats.h>
> > +#include <linux/file.h>
> >
> > #include <asm/atomic.h>
> >
> > @@ -2442,15 +2443,25 @@ static struct file_operations proc_cgrou
> > * At the point that cgroup_fork() is called, 'current' is the parent
> > * task, and the passed argument 'child' points to the child task.
> > */
> > -void cgroup_fork(struct task_struct *child)
> > +void cgroup_fork(struct task_struct *parent, struct task_struct *child)
> > {
> > - task_lock(current);
> > - child->cgroups = current->cgroups;
> > + if (parent == current)
> > + task_lock(parent);
> > + child->cgroups = parent->cgroups;
> > get_css_set(child->cgroups);
> > - task_unlock(current);
> > + if (parent == current)
> > + task_unlock(parent);
> > INIT_LIST_HEAD(&child->cg_list);
> > }
> >
> > +void cgroup_fork_fromcgroup(struct cgroup *new_cg, struct task_struct *child)
> > +{
> > + mutex_lock(&cgroup_mutex);
> > + child->cgroups = find_css_set(child->cgroups, new_cg);
> > + INIT_LIST_HEAD(&child->cg_list);
> > + mutex_unlock(&cgroup_mutex);
> > +}
> > +
> > /**
> > * cgroup_fork_callbacks - called on a new task very soon before
> > * adding it to the tasklist. No need to take any locks since no-one
> > @@ -2801,3 +2812,117 @@ static void cgroup_release_agent(struct
> > spin_unlock(&release_list_lock);
> > mutex_unlock(&cgroup_mutex);
> > }
> > +
> > +static inline int task_available(struct task_struct *task)
> > +{
> > + if (task == current)
> > + return 0;
> > + if (task_session(task) == task_session(current))
> > + return 0;
> > + switch (task->state) {
> > + case TASK_RUNNING:
> > + case TASK_INTERRUPTIBLE:
> > + return 1;
> > + default:
> > + return 0;
> > + }
> > +}
> > +
> > +struct cgroup *cgroup_from_fd(unsigned int fd)
> > +{
> > + struct file *file;
> > + struct cgroup *cgroup = NULL;;
> > +
> > + file = fget(fd);
> > + if (!file)
> > + return NULL;
> > +
> > + if (!file->f_dentry || !file->f_dentry->d_sb)
> > + goto out_fput;
> > + if (file->f_dentry->d_parent->d_sb->s_magic != CGROUP_SUPER_MAGIC)
> > + goto out_fput;
> > + if (strcmp(file->f_dentry->d_name.name, "tasks"))
> > + goto out_fput;
> > +
> > + cgroup = __d_cgrp(file->f_dentry->d_parent);
> > +
> > +out_fput:
> > + fput(file);
> > + return cgroup;
> > +}
> > +
> > +/*
> > + * Takes an integer which is a open fd in current for a valid
> > + * cgroupfs file. Returns a task in that cgroup, with its
> > + * refcount bumped.
> > + * Since we have an open file on the cgroup tasks file, we
> > + * at least don't have to worry about the cgroup being freed
> > + * in the middle of this.
> > + */
> > +struct task_struct *task_from_cgroup_fd(unsigned int fd)
> > +{
> > + struct cgroup *cgroup;
> > + struct cgroup_iter it;
> > + struct task_struct *task = NULL;
> > +
> > + cgroup = cgroup_from_fd(fd);
> > + if (!cgroup)
> > + return NULL;
> > +
> > + rcu_read_lock();
> > + cgroup_iter_start(cgroup, &it);
> > + do {
> > + task = cgroup_iter_next(cgroup, &it);
> > + if (task)
> > + printk(KERN_NOTICE "task %d state %lx\n",
> > + task->pid, task->state);
> > + } while (task && !task_available(task));
> > + cgroup_iter_end(cgroup, &it);
> > + if (task)
> > + get_task_struct(task);
> > + rcu_read_unlock();
> > + return task;
> > +}
> > +
> > +/*
> > + * is current allowed to hijack tsk?
> > + * permission will also be denied elsewhere if
> > + * current may not ptrace tsk
> > + * security_task_alloc(new_task, tsk) returns -EPERM
> > + * Here we are only checking whether current may attach
> > + * to tsk's cgroup. If you can't enter the cgroup, you can't
> > + * hijack it.
> > + *
> > + * XXX TODO This means that ns_cgroup.c will need to allow
> > + * entering all descendent cgroups, not just the immediate
> > + * child.
> > + */
> > +int cgroup_may_hijack(struct task_struct *tsk)
> > +{
> > + int ret = 0;
> > + struct cgroupfs_root *root;
> > +
> > + mutex_lock(&cgroup_mutex);
> > + for_each_root(root) {
> > + struct cgroup_subsys *ss;
> > + struct cgroup *cgroup;
> > + int subsys_id;
> > +
> > + /* Skip this hierarchy if it has no active subsystems */
> > + if (!root->actual_subsys_bits)
> > + continue;
> > + get_first_subsys(&root->top_cgroup, NULL, &subsys_id);
> > + cgroup = task_cgroup(tsk, subsys_id);
> > + for_each_subsys(root, ss) {
> > + if (ss->may_hijack) {
> > + ret = ss->may_hijack(ss, cgroup, tsk);
> > + if (ret)
> > + goto out_unlock;
> > + }
> > + }
> > + }
> > +
> > +out_unlock:
> > + mutex_unlock(&cgroup_mutex);
> > + return ret;
> > +}
> > Index: upstream/kernel/fork.c
> > ===================================================================
> > --- upstream.orig/kernel/fork.c
> > +++ upstream/kernel/fork.c
> > @@ -189,7 +189,7 @@ static struct task_struct *dup_task_stru
> > return NULL;
> > }
> >
> > - setup_thread_stack(tsk, orig);
> > + setup_thread_stack(tsk, current);
> >
> > #ifdef CONFIG_CC_STACKPROTECTOR
> > tsk->stack_canary = get_random_int();
> > @@ -616,13 +616,14 @@ struct fs_struct *copy_fs_struct(struct
> >
> > EXPORT_SYMBOL_GPL(copy_fs_struct);
> >
> > -static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
> > +static inline int copy_fs(unsigned long clone_flags,
> > + struct task_struct *src, struct task_struct *tsk)
> > {
> > if (clone_flags & CLONE_FS) {
> > - atomic_inc(¤t->fs->count);
> > + atomic_inc(&src->fs->count);
> > return 0;
> > }
> > - tsk->fs = __copy_fs_struct(current->fs);
> > + tsk->fs = __copy_fs_struct(src->fs);
> > if (!tsk->fs)
> > return -ENOMEM;
> > return 0;
> > @@ -962,6 +963,42 @@ static void rt_mutex_init_task(struct ta
> > #endif
> > }
> >
> > +void copy_hijackable_taskinfo(struct task_struct *p,
> > + struct task_struct *task)
> > +{
> > + p->uid = task->uid;
> > + p->euid = task->euid;
> > + p->suid = task->suid;
> > + p->fsuid = task->fsuid;
> > + p->gid = task->gid;
> > + p->egid = task->egid;
> > + p->sgid = task->sgid;
> > + p->fsgid = task->fsgid;
> > + p->cap_effective = task->cap_effective;
> > + p->cap_inheritable = task->cap_inheritable;
> > + p->cap_permitted = task->cap_permitted;
> > + p->keep_capabilities = task->keep_capabilities;
> > + p->user = task->user;
> > + /*
> > + * should keys come from parent or hijack-src?
> > + */
> > +#ifdef CONFIG_SYSVIPC
> > + p->sysvsem = task->sysvsem;
> > +#endif
> > + p->fs = task->fs;
> > + p->nsproxy = task->nsproxy;
> > +}
> > +
> > +#define HIJACK_SOURCE_TASK 1
> > +#define HIJACK_SOURCE_CG 2
> > +struct hijack_source_info {
> > + char type;
> > + union hijack_source_union {
> > + struct task_struct *task;
> > + struct cgroup *cgroup;
> > + } u;
> > +};
> > +
> > /*
> > * This creates a new process as a copy of the old one,
> > * but does not actually start it yet.
> > @@ -970,7 +1007,8 @@ static void rt_mutex_init_task(struct ta
> > * parts of the process environment (as per the clone
> > * flags). The actual kick-off is left to the caller.
> > */
> > -static struct task_struct *copy_process(unsigned long clone_flags,
> > +static struct task_struct *copy_process(struct hijack_source_info *src,
> > + unsigned long clone_flags,
> > unsigned long stack_start,
> > struct pt_regs *regs,
> > unsigned long stack_size,
> > @@ -980,6 +1018,12 @@ static struct task_struct *copy_process(
> > int retval;
> > struct task_struct *p;
> > int cgroup_callbacks_done = 0;
> > + struct task_struct *task;
> > +
> > + if (src->type == HIJACK_SOURCE_TASK)
> > + task = src->u.task;
> > + else
> > + task = current;
> >
> > if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
> > return ERR_PTR(-EINVAL);
> > @@ -1007,6 +1051,10 @@ static struct task_struct *copy_process(
> > p = dup_task_struct(current);
> > if (!p)
> > goto fork_out;
> > + if (current != task)
> > + copy_hijackable_taskinfo(p, task);
> > + else if (src->type == HIJACK_SOURCE_CG)
> > + copy_hijack_nsproxy(p, src->u.cgroup);
> >
> > rt_mutex_init_task(p);
> >
> > @@ -1084,7 +1132,10 @@ static struct task_struct *copy_process(
> > #endif
> > p->io_context = NULL;
> > p->audit_context = NULL;
> > - cgroup_fork(p);
> > + if (src->type == HIJACK_SOURCE_CG)
> > + cgroup_fork_fromcgroup(src->u.cgroup, p);
> > + else
> > + cgroup_fork(task, p);
> > #ifdef CONFIG_NUMA
> > p->mempolicy = mpol_copy(p->mempolicy);
> > if (IS_ERR(p->mempolicy)) {
> > @@ -1135,7 +1186,7 @@ static struct task_struct *copy_process(
> > goto bad_fork_cleanup_audit;
> > if ((retval = copy_files(clone_flags, p)))
> > goto bad_fork_cleanup_semundo;
> > - if ((retval = copy_fs(clone_flags, p)))
> > + if ((retval = copy_fs(clone_flags, task, p)))
> > goto bad_fork_cleanup_files;
> > if ((retval = copy_sighand(clone_flags, p)))
> > goto bad_fork_cleanup_fs;
> > @@ -1167,7 +1218,7 @@ static struct task_struct *copy_process(
> > p->pid = pid_nr(pid);
> > p->tgid = p->pid;
> > if (clone_flags & CLONE_THREAD)
> > - p->tgid = current->tgid;
> > + p->tgid = task->tgid;
> >
> > p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
> > /*
> > @@ -1378,8 +1429,12 @@ struct task_struct * __cpuinit fork_idle
> > {
> > struct task_struct *task;
> > struct pt_regs regs;
> > + struct hijack_source_info src;
> >
> > - task = copy_process(CLONE_VM, 0, idle_regs(®s), 0, NULL,
> > + src.type = HIJACK_SOURCE_TASK;
> > + src.u.task = current;
> > +
> > + task = copy_process(&src, CLONE_VM, 0, idle_regs(®s), 0, NULL,
> > &init_struct_pid);
> > if (!IS_ERR(task))
> > init_idle(task, cpu);
> > @@ -1404,29 +1459,43 @@ static int fork_traceflag(unsigned clone
> > }
> >
> > /*
> > - * Ok, this is the main fork-routine.
> > - *
> > - * It copies the process, and if successful kick-starts
> > - * it and waits for it to finish using the VM if required.
> > + * if called with task!=current, then caller must ensure that
> > + * 1. it has a reference to task
> > + * 2. current must have ptrace permission to task
> > */
> > -long do_fork(unsigned long clone_flags,
> > +long do_fork_task(struct hijack_source_info *src,
> > + unsigned long clone_flags,
> > unsigned long stack_start,
> > struct pt_regs *regs,
> > unsigned long stack_size,
> > int __user *parent_tidptr,
> > int __user *child_tidptr)
> > {
> > - struct task_struct *p;
> > + struct task_struct *p, *task;
> > int trace = 0;
> > long nr;
> >
> > + if (src->type == HIJACK_SOURCE_TASK)
> > + task = src->u.task;
> > + else
> > + task = current;
> > + if (task != current) {
> > + /* sanity checks */
> > + /* we only want to allow hijacking the simplest cases */
> > + if (clone_flags & CLONE_SYSVSEM)
> > + return -EINVAL;
> > + if (current->ptrace)
> > + return -EPERM;
> > + if (task->ptrace)
> > + return -EINVAL;
> > + }
> > if (unlikely(current->ptrace)) {
> > trace = fork_traceflag (clone_flags);
> > if (trace)
> > clone_flags |= CLONE_PTRACE;
> > }
> >
> > - p = copy_process(clone_flags, stack_start, regs, stack_size,
> > + p = copy_process(src, clone_flags, stack_start, regs, stack_size,
> > child_tidptr, NULL);
> > /*
> > * Do this prior waking up the new thread - the thread pointer
> > @@ -1484,6 +1553,106 @@ long do_fork(unsigned long clone_flags,
> > return nr;
> > }
> >
> > +/*
> > + * Ok, this is the main fork-routine.
> > + *
> > + * It copies the process, and if successful kick-starts
> > + * it and waits for it to finish using the VM if required.
> > + */
> > +long do_fork(unsigned long clone_flags,
> > + unsigned long stack_start,
> > + struct pt_regs *regs,
> > + unsigned long stack_size,
> > + int __user *parent_tidptr,
> > + int __user *child_tidptr)
> > +{
> > + struct hijack_source_info src = {
> > + .type = HIJACK_SOURCE_TASK,
> > + .u = { .task = current, },
> > + };
> > + return do_fork_task(&src, clone_flags, stack_start,
> > + regs, stack_size, parent_tidptr, child_tidptr);
> > +}
> > +
> > +/*
> > + * Called with task count bumped, drops task count before returning
> > + */
> > +int hijack_task(struct task_struct *task, unsigned long clone_flags,
> > + struct pt_regs regs, unsigned long sp)
> > +{
> > + int ret = -EPERM;
> > + struct hijack_source_info src = {
> > + .type = HIJACK_SOURCE_TASK,
> > + .u = { .task = task, },
> > + };
> > +
> > + task_lock(task);
> > + put_task_struct(task);
> > + if (!ptrace_may_attach_locked(task))
> > + goto out_unlock_task;
> > + if (task == current)
> > + goto out_unlock_task;
> > + ret = cgroup_may_hijack(task);
> > + if (ret)
> > + goto out_unlock_task;
> > + if (task->ptrace) {
> > + ret = -EBUSY;
> > + goto out_unlock_task;
> > + }
> > + ret = do_fork_task(&src, clone_flags, sp, ®s, 0, NULL, NULL);
> > +
> > +out_unlock_task:
> > + task_unlock(task);
> > + return ret;
> > +}
> > +
> > +int hijack_pid(pid_t pid, unsigned long clone_flags, struct pt_regs regs,
> > + unsigned long sp)
> > +{
> > + struct task_struct *task;
> > +
> > + rcu_read_lock();
> > + task = find_task_by_vpid(pid);
> > + if (task)
> > + get_task_struct(task);
> > + rcu_read_unlock();
> > +
> > + if (!task)
> > + return -EINVAL;
> > +
> > + return hijack_task(task, clone_flags, regs, sp);
> > +}
> > +
> > +int hijack_cgroup(unsigned int fd, unsigned long clone_flags,
> > + struct pt_regs regs, unsigned long sp)
> > +{
> > + struct task_struct *task;
> > +
> > + task = task_from_cgroup_fd(fd);
> > + if (!task)
> > + return -EINVAL;
> > +
> > + return hijack_task(task, clone_flags, regs, sp);
> > +}
> > +
> > +int hijack_ns(unsigned int fd, unsigned long clone_flags,
> > + struct pt_regs regs, unsigned long sp)
> > +{
> > + struct hijack_source_info src;
> > + struct cgroup *cgroup;
> > +
> > + cgroup = cgroup_from_fd(fd);
> > + if (!cgroup)
> > + return -EINVAL;
> > +
> > + if (!ns_cgroup_verify(cgroup))
> > + return -EINVAL;
> > +
> > + src.type = HIJACK_SOURCE_CG;
> > + src.u.cgroup = cgroup;
> > + return do_fork_task(&src, clone_flags, sp, ®s, 0, NULL, NULL);
> > +}
> > +
> > #ifndef ARCH_MIN_MMSTRUCT_ALIGN
> > #define ARCH_MIN_MMSTRUCT_ALIGN 0
> > #endif
> > Index: upstream/kernel/ns_cgroup.c
> > ===================================================================
> > --- upstream.orig/kernel/ns_cgroup.c
> > +++ upstream/kernel/ns_cgroup.c
> > @@ -7,9 +7,11 @@
> > #include <linux/module.h>
> > #include <linux/cgroup.h>
> > #include <linux/fs.h>
> > +#include <linux/nsproxy.h>
> >
> > struct ns_cgroup {
> > struct cgroup_subsys_state css;
> > + struct nsproxy *nsproxy;
> > spinlock_t lock;
> > };
> >
> > @@ -22,9 +24,51 @@ static inline struct ns_cgroup *cgroup_t
> > struct ns_cgroup, css);
> > }
> >
> > -int ns_cgroup_clone(struct task_struct *task)
> > +int ns_cgroup_clone(struct task_struct *task, struct nsproxy *nsproxy)
> > {
> > - return cgroup_clone(task, &ns_subsys);
> > + struct cgroup *cgroup;
> > + struct ns_cgroup *ns_cgroup;
> > + int ret = cgroup_clone(task, &ns_subsys);
> > +
> > + if (ret)
> > + return ret;
> > +
> > + cgroup = task_cgroup(task, ns_subsys_id);
> > + ns_cgroup = cgroup_to_ns(cgroup);
> > + ns_cgroup->nsproxy = nsproxy;
> > + get_nsproxy(nsproxy);
> > +
> > + return 0;
> > +}
> > +
> > +int ns_cgroup_verify(struct cgroup *cgroup)
> > +{
> > + struct cgroup_subsys_state *css;
> > + struct ns_cgroup *ns_cgroup;
> > +
> > + css = cgroup_subsys_state(cgroup, ns_subsys_id);
> > + if (!css)
> > + return 0;
> > + ns_cgroup = container_of(css, struct ns_cgroup, css);
> > + if (!ns_cgroup->nsproxy)
> > + return 0;
> > + return 1;
> > +}
> > +
> > +/*
> > + * this shouldn't be called unless ns_cgroup_verify() has
> > + * confirmed that there is a ns_cgroup in this cgroup
> > + *
> > + * tsk is not yet running, and has not yet taken a reference
> > + * to it's previous ->nsproxy, so we just do a simple assignment
> > + * rather than switch_task_namespaces()
> > + */
> > +void copy_hijack_nsproxy(struct task_struct *tsk, struct cgroup *cgroup)
> > +{
> > + struct ns_cgroup *ns_cgroup;
> > +
> > + ns_cgroup = cgroup_to_ns(cgroup);
> > + tsk->nsproxy = ns_cgroup->nsproxy;
> > }
> >
> > /*
> > @@ -60,6 +104,42 @@ static int ns_can_attach(struct cgroup_s
> > return 0;
> > }
> >
> > +static void ns_attach(struct cgroup_subsys *ss,
> > + struct cgroup *cgroup, struct cgroup *oldcgroup,
> > + struct task_struct *tsk)
> > +{
> > + struct ns_cgroup *ns_cgroup = cgroup_to_ns(cgroup);
> > +
> > + if (likely(ns_cgroup->nsproxy))
> > + return;
> > +
> > + spin_lock(&ns_cgroup->lock);
> > + if (!ns_cgroup->nsproxy) {
> > + ns_cgroup->nsproxy = tsk->nsproxy;
> > + get_nsproxy(ns_cgroup->nsproxy);
> > + }
> > + spin_unlock(&ns_cgroup->lock);
> > +}
> > +
> > +/*
> > + * only allow hijacking child namespaces
> > + * Q: is it crucial to prevent hijacking a task in your same cgroup?
> > + */
> > +static int ns_may_hijack(struct cgroup_subsys *ss,
> > + struct cgroup *new_cgroup, struct task_struct *task)
> > +{
> > + if (current == task)
> > + return -EINVAL;
> > +
> > + if (!capable(CAP_SYS_ADMIN))
> > + return -EPERM;
> > +
> > + if (!cgroup_is_descendant(new_cgroup))
> > + return -EPERM;
> > +
> > + return 0;
> > +}
> > +
> > /*
> > * Rules: you can only create a cgroup if
> > * 1. you are capable(CAP_SYS_ADMIN)
> > @@ -88,12 +168,16 @@ static void ns_destroy(struct cgroup_sub
> > struct ns_cgroup *ns_cgroup;
> >
> > ns_cgroup = cgroup_to_ns(cgroup);
> > + if (ns_cgroup->nsproxy)
> > + put_nsproxy(ns_cgroup->nsproxy);
> > kfree(ns_cgroup);
> > }
> >
> > struct cgroup_subsys ns_subsys = {
> > .name = "ns",
> > .can_attach = ns_can_attach,
> > + .attach = ns_attach,
> > + .may_hijack = ns_may_hijack,
> > .create = ns_create,
> > .destroy = ns_destroy,
> > .subsys_id = ns_subsys_id,
> > Index: upstream/kernel/nsproxy.c
> > ===================================================================
> > --- upstream.orig/kernel/nsproxy.c
> > +++ upstream/kernel/nsproxy.c
> > @@ -144,7 +144,7 @@ int copy_namespaces(unsigned long flags,
> > goto out;
> > }
> >
> > - err = ns_cgroup_clone(tsk);
> > + err = ns_cgroup_clone(tsk, new_ns);
> > if (err) {
> > put_nsproxy(new_ns);
> > goto out;
> > @@ -196,7 +196,7 @@ int unshare_nsproxy_namespaces(unsigned
> > goto out;
> > }
> >
> > - err = ns_cgroup_clone(current);
> > + err = ns_cgroup_clone(current, *new_nsp);
> > if (err)
> > put_nsproxy(*new_nsp);
> >
> > Index: upstream/kernel/ptrace.c
> > ===================================================================
> > --- upstream.orig/kernel/ptrace.c
> > +++ upstream/kernel/ptrace.c
> > @@ -159,6 +159,13 @@ int ptrace_may_attach(struct task_struct
> > return !err;
> > }
> >
> > +int ptrace_may_attach_locked(struct task_struct *task)
> > +{
> > + int err;
> > + err = may_attach(task);
> > + return !err;
> > +}
> > +
> > int ptrace_attach(struct task_struct *task)
> > {
> > int retval;
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> > the body of a message to majordomo at vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
>
> --
> Crispin Cowan, Ph.D. http://crispincowan.com/~crispin
> CEO, Mercenary Linux http://mercenarylinux.com/
> Itanium. Vista. GPLv3. Complexity at work
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list