[Devel] Re: [PATCH 11/11][v15]: Document sys_eclone

Mon Jul 5 20:59:19 PDT 2010

On Mon, Jul 5, 2010 at 12:18 AM, Oren Laadan <orenl at cs.columbia.edu> wrote:
> Matt Helsley wrote:
>> On Sat, Jul 03, 2010 at 07:41:30PM -0400, Albert Cahalan wrote:
>>> On Sat, Jul 3, 2010 at 4:32 PM, Sukadev Bhattiprolu
>>> <sukadev at linux.vnet.ibm.com> wrote:

> It follows that trying to set pid's in pid-namespaces _below_ you
> simply doesn't make sense (beyond the CLONE_NEWPID case).

I may have some wrong ideas about how process restart works,
but I'd thought it would normally be done from above or from PID 1
in the same pid namespace.

> Finally, there have been objections before to allow pid-selection
> by non-privileged process.

Eh, I dearly hope that privileged processes are generally not
even addressable (never mind creatable or accessable) from
inside anything other than the top-level pid namespace.

Well, at least nothing should get more privilege than PID 1.
This would include having UID values that PID 1 can switch
to and having capability sets that PID 1 can switch to, and
any other (SE Linux, AppArmor, etc.) things too.

Restarting a privileged process with a less privileged PID 1
should result in privilege loss, and ought to require some sort of
"--force" option to ensure the person accepts possible breakage.

>>>> +static int do_clone(int (*child_fn)(void *), void *child_arg,
>>>> +               unsigned int flags_low, int nr_pids, pid_t *pids_list)
>>>
>>> There needs to be a way to pass child_fn and child_arg
>>> via the kernel. Besides being required for kernel-managed
>>> stacks, it's normally a saner interface. Stack setup would
>>> be much like the stack setup for signal handlers. Imagine
>>
>> I'm inclined to say this is a bad idea.
>>
>> I didn't think we had "kernel-managed stacks" in mainline. The most we
>> have, to my knowledge, is the sigaltstack support and kernel threads.
>>
>> I don't see how being able to pass in child_fn and child_arg to the
>> kernel improves the sanity of the interface. If anything it will make
>> eclone even more exotic -- now at the end of the syscall we'll
>> need to mess with the registers/stack of the child much like when we're
>> invoking a signal handler. That just adds more arch-specific code than is
>> necessary.
>>
>> Userspace wrappers are perfectly capable of invoking the child function
>> and passing the arguments. Furthermore, passing those arguments requires
>> expanding the argument structure or putting even greater pressure on
>> registers (which, as you pointed out below, is an issue for vfork).

BSD's rfork_thread has, among other things, these two arguments:

int (*func)(void *arg)
void *arg

>>> using this for a vfork-like interface that didn't have painful
>>> interactions with the compiler.
>
> Pardon my ignorance - what sort of painful interactions ?

The child returns from vfork, via the same return address that
the parent will later use. (on the stack for many architectures)
The child then calls a function which might not have the same
stack layout as vfork, scrambling whatever may be on the stack
that the parent will be using to return from vfork. The parent may
then end up using a return address that has been corrupted.
To make this work, gcc actually recognizes vfork and has
special handling for it.
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers