[Devel] [PATCH v3 4/9] user namespace support for upstream containers
Glauber Costa
glommer at parallels.com
Tue May 14 00:32:25 PDT 2013
On 05/14/2013 07:02 AM, Kir Kolyshkin wrote:
> Oh my, four cases of whitespace-at-eol which I had to fixed manually.
>
Sorry about that. I think I got to used to checkpatch and the such that
I forget to verify all of those manually.
> Wanted to apply it nevertheless and fix some things myself, but then the
> list grew out too big.
>
> Plus please see inline.
>
Here we go.
>> /* This function is there in GLIBC, but not in headers */
>> extern int pivot_root(const char * new_root, const char * put_old);
>> @@ -138,10 +140,41 @@ static int _env_create(void *data)
>> struct env_create_param3 create_param;
>> int ret;
>> - if ((ret = ct_chroot(arg->res->fs.root)))
>> + if ((arg->userns_p != -1) && (read(arg->userns_p, &ret,
>> sizeof(ret)) == 0))
>
> 1 I'd suggest comparing with sizeof(ret)
As you wish.
> 2 We can't print any errors here, can we? If we can I'd rather do it.
>
I believe logger works in here, yes.
>> + return -1;
>
> You're supposed to return VZ_* exit codes from here.
>
ok.
>> +
>> + ret = ct_chroot(arg->res->fs.root);
>> + close(arg->userns_p);
>> + /* Probably means chroot failed */
>> + if (ret)
>> return ret;
>> - if ((ret = vps_set_cap(arg->veid, &arg->res->env,
>> &arg->res->cap, 1)))
>> + if (arg->h->can_join_userns) {
>> + int fd;
>> + setuid(0);
>> + setgid(0);
>> + /*
>> + * We need the special flag "newinstance". This is a requirement
>> + * of the userns-aware implementation of devpts as of Linux 3.9.
>> + * Because of that special requirement, we do it here rather
>> than
>> + * later.
>> + */
>> + mount("devpts", "/dev/pts", "devpts", 0, "newinstance");
>
> What if mount point does not exist, or is not a directory?
>
> I think it makes sense to
> 1. try creating it, ignoring EEXIST
> 2. check return from mount()
>
Check return and then what? It is not clear for me if we should abort or
not when things go wrong. It seems perfectly valid that the container
continues. I have the same dilemma with the other devices.
>> + /* /dev/ptmx, if it even exists, would refer to the root ptmx.
>> + * We don't want that, we want our newly created instance to
>> contain
>> + * all ptys. So we bind mount the root device here
>> + */
>> + fd = open("/dev/ptmx", O_RDWR|O_CREAT, 0);
>
> You do open() to create the /dev/ptmx file in case it does not exist,
> right?
> Why RDWR then?
>
No special reason.
>> +/*
>> + * Those devices should exist in the container, and be valid device
>> nodes with
>> + * user access permission. But we need to be absolutely sure this is
>> the case,
>> + * so we will provide our own versions. That could actually happen
>> since some
>> + * distributions may come with emptied /dev's, waiting for udev to
>> populate them.
>> + * That won't happen, we do it ourselves.
>> + */
>> +static void create_devices(vps_handler *h, envid_t veid, const char
>> *root)
>> +{
>> + unsigned int i;
>> + char *devices[] = {
>> + "/dev/null",
>> + "/dev/zero",
>> + "/dev/random",
>> + "/dev/urandom",
>> + };
>> +
>> + /*
>> + * We will tolerate errors, and keep the container running,
>> because it is
>> + * likely we will be able to boot it to a barely functional
>> state. But
>> + * be vocal about it
>> + */
>> + for (i = 0; i < ARRAY_SIZE(devices); i++) {
>> + char ct_devname[STR_SIZE];
>> + int ret;
>> +
>> + ret = snprintf(ct_devname, sizeof(ct_devname), "%s/%s", root,
>> devices[i]);
>> + if (ret < 0) {
>> + logger(-1, errno, "Could not allocate device string\n");
>
> we don't need \n for logger() calls.
>
Will remove, but the most important question here is the behavioral
question. Is it okay to proceed ?
>> static int ct_env_create(struct arg_start *arg)
>> {
>> @@ -233,6 +348,8 @@ static int ct_env_create(struct arg_start *arg)
>> int ret;
>> char procpath[STR_SIZE];
>> char ctpath[STR_SIZE];
>> + int userns_p[2];
>> + int err;
>> stack_size = get_pagesize();
>> if (stack_size < 0)
>> @@ -272,17 +389,58 @@ static int ct_env_create(struct arg_start *arg)
>> * Belong in the setup phase
>> */
>> clone_flags = SIGCHLD;
>> - /* FIXME: USERNS is still work in progress */
>> clone_flags |= CLONE_NEWUTS|CLONE_NEWPID|CLONE_NEWIPC;
>> clone_flags |= CLONE_NEWNET|CLONE_NEWNS;
>> + if (!arg->h->can_join_userns) {
>> + logger(-1, 0, "WARNING: Running container unprivileged.
>> USER_NS not supported");
>> +
>> + userns_p[0] = userns_p[1] = -1;
>> + } else {
>> + clone_flags |= CLONE_NEWUSER;
>> + if (pipe(userns_p) < 0) {
>> + logger(-1, errno, "Can not create userns pipe");
>> + return VZ_RESOURCE_ERROR;
>> + }
>> + }
>> + arg->userns_p = userns_p[0];
>> +
>> + if (arg->h->can_join_userns) {
>> + create_devices(arg->h, arg->veid, arg->res->fs.root);
>> + }
>> +
>> ret = clone(_env_create, child_stack, clone_flags, arg);
>> if (ret < 0) {
>> logger(-1, errno, "Unable to clone");
>> /* FIXME: remove ourselves from container first */
>> destroy_container(arg->veid);
>> return VZ_RESOURCE_ERROR;
>> + } else if (arg->h->can_join_userns) {
>
> You don't need "else" here, do you?
>
Not the else, but I do need to following "if", so I tend to think it is
just cleaner this way.
>> + /*
>> + * Now we need to write to the mapping file. It has to be us,
>> + * since CAP_SETUID is required in the parent namespace. vzctl
>> + * is run as root, so we should have it. But our cloned kid
>> + * will start as the overflow uid 65534 in the new namespace.
>> + */
>> + if (write_uid_gid_mapping(arg->h, *arg->res->misc.local_uid,
>> + *arg->res->misc.local_gid, ret))
>> + return VZ_RESOURCE_ERROR;
>
> Do we need something else here? Maybe destroy_container(arg->veid) and
> kill the child?
I guess we could factor it in a way
>
>> + * since some runs will execute with the mapping already done,
>> + * while others with the mapping off. This is particularly
>> + * important for setuid, for instance. It will categorically
>> + * fail if called before a mapping is in place.
>> + */
>> + if ((userns_p[1] != -1) &&
>> + write(userns_p[1], &err, sizeof(err)) != sizeof(err)) {
>> + logger(-1, errno, "Unable to read from userns pipe");
>
> s/read from/write to/
>
> and if write() fails we do not close fd
>
>> + return VZ_RESOURCE_ERROR;
>> + }
>> }
>> + close(userns_p[1]); /* other end closed by child */
>
> Are you sure? This is what clone(2) man page says:
>
>
I never meant that this is closed automatically by some clone behavior.
It should be closed explicitly by the child. But I see what you mean now.
> Essentially you have two sets of fds after clone(), and you need to
> close all them separately. Usually we do close the end we don't need
> right after clone() and the other one after use.
>
> Same in child (i.e. _env_create()) -- we should close one end right at
> the start, and the other end after using it. That, I guess, means, that
> you have to pass both fds to the child -- the other one just for the
> sake of closing.
>
Humm, for all the other fds, I am pretty sure we are only passing one
end to the child, no ?
Also, I know it is good to be explicit, but we do close all fds in the
child before executing, through close_fds(). So I don't think it is
worthy to clutter the parameters further just so we can close them
explicitly.
More information about the Devel
mailing list