[Devel] Re: [PATCH 11/15] Signal semantics

Fri Jul 27 12:59:43 PDT 2007

Quoting sukadev at us.ibm.com (sukadev at us.ibm.com):
> Pavel Emelianov [xemul at openvz.org] wrote:
> | Oleg Nesterov wrote:
> | >Damn. I don't have time to read these patches today (will try tomorrow),
> | 
> | Oh, that's OK. I was about to send the set to Andrew only the next week.
> | 
> | This patch is the most strange one and is to be discussed a lot.
> | 
> | We try to do the following two things:
> | 1. signals going from the namespace, that the target task doesn't
> |   see must be seen as SI_KERNEL if siginfo is allocated;
> | 2. signals to init of any namespace must be allowed to send from
> |   one of the parent namespaces only. From child namespace, init
> |   needs only those, that it's ready to handle (SIGCHLD).
> 
> Yes.
> 
> | 
> | As far as I understand Suka's approach (it's his patch, so I may 
> | be not 100% correct - it's better to wait for his comments) he is 
> | trying to carry the information about the signal up to the
> | get_signal_to_deliver().
> | 
> | As far as the first issue is concerned, the solution is obvious -
> | all the "calculations" can be done at the beginning of sending the
> | signal, but the second issue is a bit more complicated and I have
> | no good ideas of how to solve this :( yet.
> 
> Even I am looking for a better approach.
> 
> | 
> | Thanks,
> | Pavel
> | 
> | >but when I glanced at this patch yesterday I had some suspicions...
> | >
> | >On 07/26, Pavel Emelyanov wrote:
> | >>+++ linux-2.6.23-rc1-mm1-7/kernel/signal.c	2007-07-26 
> | >>16:36:37.000000000 +0400
> | >>@@ -323,6 +325,9 @@ static int collect_signal(int sig, struc
> | >>	if (first) {
> | >>		list_del_init(&first->list);
> | >>		copy_siginfo(info, &first->info);
> | >>+		if (first->flags & SIGQUEUE_CINIT)
> | >>+			kinfo->flags |= KERN_SIGINFO_CINIT;
> | >>+
> | >>
> | >>[...snip...]
> | >>
> | >>@@ -1852,7 +1950,7 @@ relock:
> | >>		 * within that pid space. It can of course get signals from
> | >>		 * its parent pid space.
> | >>		 */
> | >>-		if (current == task_child_reaper(current))
> | >>+		if (kinfo.flags & KERN_SIGINFO_CINIT)
> | >>			continue;
> | >
> | >I think the whole idea is broken, it assumes the sender put something into
> | >"struct sigqueue".
> | 
> | Yup. That's the problem. It seems to me that the only way to handle init's
> | signals is to check for permissions in the sending path.
> 
> We can check permissions in the sending path - and in fact we do check for
> SIGKILL case (deny_signal_to_container_init() below).
> 
> But the receiver knows/decides whether or not the signal is wanted/not. No ?
> 
> Are you saying we should check/special case all fatal signals ?
> 
> | 
> | >Suppose that /sbin/init has no handler for (say) SIGTERM, and we send this
> | >signal from the same namespace. send_signal() sets SIGQUEUE_CINIT, but it
> | >is lost because __group_complete_signal() silently "converts" sig_fatal()
> | >signals to SIGKILL using sigaddset().
> 
> Yes, I should have called it out, but this patch currently assumes /sbin/init
> (or container-init) has a handler for the fatal signals like SIGTERM and has
> a check for SIGKILL (in deny_signal_to_container_init() - as Oleg noted below).
> 
> Still looking for better ways to implement.
> 
> | >
> | >>+static void encode_sender_info(struct task_struct *t, struct sigqueue *q)
> | >>+{
> | >>+	/*
> | >>+	 * If sender (i.e 'current') and receiver have the same active
> | >>+	 * pid namespace and the receiver is the container-init, set the
> | >>+	 * SIGQUEUE_CINIT flag. This tells the container-init that the
> | >>+	 * signal originated in its own namespace and so it can choose
> | >>+	 * to ignore the signal.
> | >>+	 *
> | >>+	 * If the receiver is the container-init of a pid namespace,
> | >>+	 * but the sender is from an ancestor pid namespace, the
> | >>+	 * container-init cannot ignore the signal. So clear the
> | >>+	 * SIGQUEUE_CINIT flag in this case.
> | >>+	 *
> | >>+	 * Also, if the sender does not have a pid_t in the receiver's
> | >>+	 * active pid namespace, set si_pid to 0 and pretend it originated
> | >>+	 * from the kernel.
> | >>+	 */
> | >>+	if (pid_ns_equal(t)) {
> | >>+		if (is_container_init(t)) {
> | >>+			q->flags |= SIGQUEUE_CINIT;
> | >
> | >Ironically, this change carefully preserves the bug we already have :)
> | >
> | >This doesn't protect init from "bad" signal if we send it to sub-thread
> | >of init. Actually, this make the behaviour a bit worse compared to what
> | >we currently have. Currently, at least the main init's thread survives
> | >if we send SIGKILL to sub-thread.
> 
> Do you mean "init's main thread" ? But doesn't SIGKILL to any thread kill
> the entire process ?
> 
> | >
> | >>static int send_signal(int sig, struct siginfo *info, struct task_struct 
> | >>*t,
> | >>			struct sigpending *signals)
> | >>{
> | >>@@ -710,6 +781,7 @@ static int send_signal(int sig, struct s
> | >>			copy_siginfo(&q->info, info);
> | >>			break;
> | >>		}
> | >>+		encode_sender_info(t, q);
> | >
> | >We still send the signal if __sigqueue_alloc() fails. In that case, the
> | >dequeued siginfo won't have SIGQUEUE_CINIT/KERN_SIGINFO_CINIT, not good.
> 
> Yes.
> 
> | >
> | >>@@ -1158,6 +1232,13 @@ static int kill_something_info(int sig, 
> | >>
> | >>		read_lock(&tasklist_lock);
> | >>		for_each_process(p) {
> | >>+			/*
> | >>+			 * System-wide signals apply only to the sender's
> | >>+			 * pid namespace, unless issued from init_pid_ns.
> | >>+			 */
> | >>+			if (!task_visible_in_pid_ns(p, my_ns))
> | >>+				continue;
> | >>+
> | >>			if (p->pid > 1 && p->tgid != current->tgid) {
> | >
> | >This "p->pid > 1" check should die.
> | >
> 
> Ok.
> 
> | >>+static int deny_signal_to_container_init(struct task_struct *tsk, int 
> | >>sig)
> | >>+{
> | >>+	/*
> | >>+	 * If receiver is the container-init of sender and signal is SIGKILL
> | >>+	 * reject it right-away. If signal is any other one, let the 
> | >>container
> | >>+	 * init decide (in get_signal_to_deliver()) whether to handle it or
> | >>+	 * ignore it.
> | >>+	 */
> | >>+	if (is_container_init(tsk) && (sig == SIGKILL) && pid_ns_equal(tsk))
> | >>+		return -EPERM;
> | >>+
> | >>+	return 0;
> | >>+}
> | >>+
> | >>/*
> | >> * Bad permissions for sending the signal
> | >> */
> | >>@@ -545,6 +584,10 @@ static int check_kill_permission(int sig
> | >>	    && !capable(CAP_KILL))
> | >>		return error;
> | >>
> | >>+	error = deny_signal_to_container_init(t, sig);
> | >>+	if (error)
> | >>+		return error;
> | >
> | >Hm. Could you explain this change? Why do we need a special check for 
> | >SIGKILL?
> 
> As you pointed out above, SIGKILL goes through the __group_complete_signal()/
> sigaddset() path and bypasses/loses the KERN_SIGINFO_CINIT flag. Other
> sig_fatal() signals take this path too, but we assume for now, container-init
> has a handler.
> 
> 
> | >
> | >
> | >(What about ptrace_attach() btw? If it is possible to send a signal to init
> | > from the "parent" namespace, perhaps it makes sense to allow ptracing as
> | > well).
> | 
> | ptracing of tasks fro different namespaces is not possible at all, since
> | strace utility determines the fork()-ed child pid from the parent's eax
> | register, which would contain the pid value as this parent sees his child.
> | But if the strace is in different namespace - it won't be able to find
> | this child with the pid value from parent's eax.
> | 
> | Maybe it's worth disabling cross-namespaces ptracing...
> 
> I think so too. Its probably not a serious limitation ?

Several people think we will implement 'namespace entering' through a
ptrace hack, where maybe the admin ptraces the init in a child pidns,
makes it fork, and makes the child execute what it wants (i.e. ps -ef).

You're talking about killing that functionality?

-serge