[Devel] [PATCH v3 4/9] user namespace support for upstream containers

Tue May 14 00:32:25 PDT 2013

On 05/14/2013 07:02 AM, Kir Kolyshkin wrote:
> Oh my, four cases of whitespace-at-eol which I had to fixed manually.
> 
Sorry about that. I think I got to used to checkpatch and the such that
I forget to verify all of those manually.

> Wanted to apply it nevertheless and fix some things myself, but then the
> list grew out too big.
> 
> Plus please see inline.
> 
Here we go.

>>   /* This function is there in GLIBC, but not in headers */
>>   extern int pivot_root(const char * new_root, const char * put_old);
>>   @@ -138,10 +140,41 @@ static int _env_create(void *data)
>>       struct env_create_param3 create_param;
>>       int ret;
>>   -    if ((ret = ct_chroot(arg->res->fs.root)))
>> +    if ((arg->userns_p != -1) && (read(arg->userns_p, &ret,
>> sizeof(ret)) == 0))
> 
> 1 I'd suggest comparing with sizeof(ret)
As you wish.

> 2 We can't print any errors here, can we? If we can I'd rather do it.
> 
I believe logger works in here, yes.
>> +        return -1;
> 
> You're supposed to return VZ_* exit codes from here.
> 

ok.

>> +
>> +    ret = ct_chroot(arg->res->fs.root);
>> +    close(arg->userns_p);
>> +    /* Probably means chroot failed */
>> +    if (ret)
>>           return ret;
>>   -    if ((ret = vps_set_cap(arg->veid, &arg->res->env,
>> &arg->res->cap, 1)))
>> +    if (arg->h->can_join_userns) {
>> +        int fd;
>> +        setuid(0);
>> +        setgid(0);
>> +        /*
>> +         * We need the special flag "newinstance". This is a requirement
>> +         * of the userns-aware implementation of devpts as of Linux 3.9.
>> +         * Because of that special requirement, we do it here rather
>> than
>> +         * later.
>> +         */
>> +        mount("devpts", "/dev/pts", "devpts", 0, "newinstance");
> 
> What if mount point does not exist, or is not a directory?
> 
> I think it makes sense to
> 1. try creating it, ignoring EEXIST
> 2. check return from mount()
> 
Check return and then what? It is not clear for me if we should abort or
not when things go wrong. It seems perfectly valid that the container
continues. I have the same dilemma with the other devices.

>> +        /* /dev/ptmx, if it even exists, would refer to the root ptmx.
>> +         * We don't want that, we want our newly created instance to
>> contain
>> +         * all ptys. So we bind mount the root device here
>> +         */
>> +        fd = open("/dev/ptmx", O_RDWR|O_CREAT, 0);
> 
> You do open() to create the /dev/ptmx file in case it does not exist,
> right?
> Why RDWR then?
> 
No special reason.

>> +/*
>> + * Those devices should exist in the container, and be valid device
>> nodes with
>> + * user access permission. But we need to be absolutely sure this is
>> the case,
>> + * so we will provide our own versions. That could actually happen
>> since some
>> + * distributions may come with emptied /dev's, waiting for udev to
>> populate them.
>> + * That won't happen, we do it ourselves.
>> + */
>> +static void create_devices(vps_handler *h, envid_t veid, const char
>> *root)
>> +{
>> +    unsigned int i;
>> +    char *devices[] = {
>> +        "/dev/null",
>> +        "/dev/zero",
>> +        "/dev/random",
>> +        "/dev/urandom",
>> +    };
>> +
>> +    /*
>> +     * We will tolerate errors, and keep the container running,
>> because it is
>> +     * likely we will be able to boot it to a barely functional
>> state. But
>> +     * be vocal about it
>> +     */
>> +    for (i = 0; i < ARRAY_SIZE(devices); i++) {
>> +        char ct_devname[STR_SIZE];
>> +        int ret;
>> +
>> +        ret = snprintf(ct_devname, sizeof(ct_devname), "%s/%s", root,
>> devices[i]);
>> +        if (ret < 0) {
>> +            logger(-1, errno, "Could not allocate device string\n");
> 
> we don't need \n for logger() calls.
> 
Will remove, but the most important question here is the behavioral
question. Is it okay to proceed ?

>>   static int ct_env_create(struct arg_start *arg)
>>   {
>>   @@ -233,6 +348,8 @@ static int ct_env_create(struct arg_start *arg)
>>       int ret;
>>       char procpath[STR_SIZE];
>>       char ctpath[STR_SIZE];
>> +    int userns_p[2];
>> +    int err;
>>         stack_size = get_pagesize();
>>       if (stack_size < 0)
>> @@ -272,17 +389,58 @@ static int ct_env_create(struct arg_start *arg)
>>        * Belong in the setup phase
>>        */
>>       clone_flags = SIGCHLD;
>> -    /* FIXME: USERNS is still work in progress */
>>       clone_flags |= CLONE_NEWUTS|CLONE_NEWPID|CLONE_NEWIPC;
>>       clone_flags |= CLONE_NEWNET|CLONE_NEWNS;
>>   +    if (!arg->h->can_join_userns) {
>> +        logger(-1, 0, "WARNING: Running container unprivileged.
>> USER_NS not supported");
>> +
>> +        userns_p[0] = userns_p[1] = -1;
>> +    } else {
>> +        clone_flags |= CLONE_NEWUSER;
>> +        if (pipe(userns_p) < 0) {
>> +            logger(-1, errno, "Can not create userns pipe");
>> +            return VZ_RESOURCE_ERROR;
>> +        }
>> +    }
>> +    arg->userns_p = userns_p[0];
>> +
>> +    if (arg->h->can_join_userns) {
>> +        create_devices(arg->h, arg->veid, arg->res->fs.root);
>> +    }
>> +
>>       ret = clone(_env_create, child_stack, clone_flags, arg);
>>       if (ret  < 0) {
>>           logger(-1, errno, "Unable to clone");
>>           /* FIXME: remove ourselves from container first */
>>           destroy_container(arg->veid);
>>           return VZ_RESOURCE_ERROR;
>> +    } else if (arg->h->can_join_userns) {
> 
> You don't need "else" here, do you?
> 
Not the else, but I do need to following "if", so I tend to think it is
just cleaner this way.

>> +        /*
>> +         * Now we need to write to the mapping file. It has to be us,
>> +         * since CAP_SETUID is required in the parent namespace. vzctl
>> +         * is run as root, so we should have it. But our cloned kid
>> +         * will start as the overflow uid 65534 in the new namespace.
>> +         */
>> +        if (write_uid_gid_mapping(arg->h, *arg->res->misc.local_uid,
>> +                      *arg->res->misc.local_gid, ret))
>> +            return VZ_RESOURCE_ERROR;
> 
> Do we need something else here? Maybe destroy_container(arg->veid) and
> kill the child?
I guess we could factor it in a way
> 
>> +         * since some runs will execute with the mapping already done,
>> +         * while others with the mapping off. This is particularly
>> +         * important for setuid, for instance. It will categorically
>> +         * fail if called before a mapping is in place.
>> +         */
>> +        if ((userns_p[1] != -1) &&
>> +            write(userns_p[1], &err, sizeof(err)) != sizeof(err)) {
>> +            logger(-1, errno, "Unable to read from userns pipe");
> 
> s/read from/write to/
> 
> and if write() fails we do not close fd
> 
>> +            return VZ_RESOURCE_ERROR;
>> +        }
>>       }
>> +    close(userns_p[1]); /* other end closed by child */
> 
> Are you sure? This is what clone(2) man page says:
> 
> 
I never meant that this is closed automatically by some clone behavior.
It should be closed explicitly by the child. But I see what you mean now.

> Essentially you have two sets of fds after clone(), and you need to
> close all them separately. Usually we do close the end we don't need
> right after clone() and the other one after use.
> 
> Same in child (i.e. _env_create()) -- we should close one end right at
> the start, and the other end after using it. That, I guess, means, that
> you have to pass both fds to the child -- the other one just for the
> sake of closing.
> 
Humm, for all the other fds, I am pretty sure we are only passing one
end to the child, no ?

Also, I know it is good to be explicit, but we do close all fds in the
child before executing, through close_fds(). So I don't think it is
worthy to clutter the parameters further just so we can close them
explicitly.