[Devel] Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.

Eric W. Biederman ebiederm at xmission.com
Fri Mar 5 12:26:30 PST 2010


Pavel Emelyanov <xemul at parallels.com> writes:

>> 2 parallel enters?  I meant you have pid 0 in the entered pid namespace.
>> You have pid 0 because your pid simply does not map.
>
> Oh, I see.
>
>> There is nothing that makes to parallel enters impossible in that.
>> Even today we have one thread per cpu that has task->pid == &init_struct_pid
>> which is pid 0.
>
> How about the forked processes then? Who will be their parent?

The normal rules of parentage apply.   So the child will see simply
see it's parent as ppid == 0.  If that child daemonizes it will become
a child of the pid namespaces init.

This is a lot like something that gets started from call_usermodehelper.  It's
parent process is not a descendant of init either.


The implementation of the join is to simply change current->nsproxy->pid_ns.
Then to use it you simply fork to get a child in the target pid namespace.

>> For the case of unshare where we are designed to be used with PAM I don't
>> think my proposed semantics work.  For a join needed an extra fork before
>> you are really in the pid namespace should be minor.
>
> Hm... One more proposal - can we adopt the planned new fork_with_pids system
> call to fork the process right into a new pid namespace?

In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I
don't think anything I am doing fundamentally undermines it.  The use
case of doing things in fork is that there is automatic inheritance of
everything.  All of the namespaces and all of the control groups, and
possibly also the parent process.  It does have the high cost that the
process we are copying from must be stopped because there are no locks
that let us take everything.  I haven't looked at the recent proposals
to see if anyone has solved that problem cleanly.



If we can do a sys_hijack/sys_cloneat style of join, that means we can
afford a fork.  At which point the my proposed pid namespace semantics
should be fine.

aka:
setns(NSTYPE_PID);
pid = fork();
if (pid == 0) {
	getpid() == 2; /* Or whatever the first free pid is joined pid namespace */
        getppid() == 0;
} else {
	pid == 6400; /* Or whatever the first free pid is in the original pid namespace */
	waitpid(pid);
}

>> That doesn't handle the case of cached struct pids.  A good example is
>> waitpid, where it waits for a specific struct pid.  Which means that
>> allocating a new struct pid and changing task->pid will cause
>> waitpid(pid) to wait forever...
>
> OK. Good example. Thanks.
>
>> To change struct pid would require the refcount on struct pid to show
>> no references from anywhere except the task_struct.
>
> I think this is OK to return -EBUSY for this. And fix the waitpid
> respectively not to block this common case. All the others I think
> can be stayed as is.

That would probably work.  setsid() and setpgrp() have similar sorts
of restrictions.  That is both more challenging and more limiting than
the semantics that come out of my unshare(CLONE_NEWPID) patch.  So I
would prefer to keep this sort of thing as a last resort.

>> At the cost of a little memory we can solve that problem for unshare
>> if we have a an extra upid in struct pid, how we verify there is space
>> in struct pid I'm not certain.
>> 
>> I do think that at least until someone calls exec the namespace pids are
>> reported to the process itself should not change.  That is kill and
>
> Wait a second - in that case the wait will be blocked too! No?

If all we do is populate an unused struct upid in struct pid there
isn't a chance of a problem.  

>> waitpid etc.  Which suggests an implementation the opposite of what
>> I proposed.  With ns_of_pid(task_pid(current)) being used as the
>> pid namespace of children, and current->nsproxy->pid_ns not changing
>> in the case of unshare.
>> 
>> Shrug.
>> 
>> Or perhaps this is a case where we use we can implement join with
>> an extra process but we can't implement unshare, because the effect
>> cannot be immediate.
>
> Well, I'm talking only about the join now.

Overall it sounds like the semantics I have proposed with
unshare(CLONE_NEWPID) are workable, and simple to implement.  The
extra fork is a bit surprising but it certainly does not
look like a show stopper for implementing a pid namespace join.

Eric
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers




More information about the Devel mailing list