[Devel] Re: [PATCH 11/11][v15]: Document sys_eclone
Albert Cahalan
acahalan at gmail.com
Sat Jul 3 16:41:30 PDT 2010
On Sat, Jul 3, 2010 at 4:32 PM, Sukadev Bhattiprolu
<sukadev at linux.vnet.ibm.com> wrote:
> +struct clone_args {
> + u64 clone_flags_high;
> + u64 child_stack_base;
> + u64 child_stack_size;
> + u64 parent_tid_ptr;
> + u64 child_tid_ptr;
> + u32 nr_pids;
> + u32 reserved0;
> +};
> +
> +
> +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
> + pid_t * __user pids)
I don't see why cargs_size is needed for expansion if you have flags.
> + The order of pids in @pids is oldest in pids[0] to youngest pid
> + namespace in pids[nr_pids-1]. If the number of pids specified in the
> + @pids list is fewer than the nesting level of the process, the pids
> + are applied from youngest namespace. I.e if the process is nested in
> + a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
> + are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
> + have a pid of '0' (the kernel will assign a pid in those namespaces).
That feels backwards. I'd have guessed pids[0] is how the
process sees itself. You'd truncate the array to reduce nesting
level rather than pointing into it.
> + On failure, eclone() returns -1 and sets 'errno' to one of following
> + values (the child process is not created).
Careful here: do you intend to document the system call itself,
or an expected glibc wrapper that doesn't exist yet?
> + EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to
> + specify the pids in this call (if pids are not specifed
> + CAP_SYS_ADMIN is not required).
It seems appropriate to let PID 1 in any PID namespace be
able to assign PIDs in it's own namespace and in any
child namespaces.
> + EINVAL The child_stack_size field is not 0 (on architectures that
> + pass in a stack pointer in ->child_stack field).
need to change this
> + "int $0x80\n\t" /* Linux/i386 system call */
> + "testl %0,%0\n\t" /* check return value */
> + "jne 1f\n\t" /* jump if parent */
> +
> + "popl %%esi\n\t" /* get subthread function */
> + "call *%%esi\n\t" /* start subthread function */
> + "movl %2,%0\n\t"
> + "int $0x80\n" /* exit system call: exit subthread */
...
> +/*
> + * Allocate a stack for the clone-child and arrange to have the child
> + * execute @child_fn with @child_arg as the argument.
> + */
...
> + *--stack = child_arg;
> + *--stack = child_fn;
...
> +static int do_clone(int (*child_fn)(void *), void *child_arg,
> + unsigned int flags_low, int nr_pids, pid_t *pids_list)
There needs to be a way to pass child_fn and child_arg
via the kernel. Besides being required for kernel-managed
stacks, it's normally a saner interface. Stack setup would
be much like the stack setup for signal handlers. Imagine
using this for a vfork-like interface that didn't have painful
interactions with the compiler.
Speaking of vfork....
1. can you implement it for i386 (register starved) using eclone?
2. can you restart a pair of processes between vfork and execve?
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list