[CRIU] [PATCH 2/7] cgroup: add support for cgroup namespaces
Tycho Andersen
tycho.andersen at canonical.com
Sun Feb 21 08:29:15 PST 2016
On Fri, Feb 19, 2016 at 09:09:59PM +0300, Pavel Emelyanov wrote:
>
> >>> {
> >>> int i;
> >>>
> >>> @@ -1023,6 +1030,26 @@ static int move_in_cgroup(CgSetEntry *se)
> >>>
> >>> aux_off = ctrl_dir_and_opt(ctrl, aux, sizeof(aux), NULL, 0);
> >>>
> >>> + if (do_cgns_set && ce->cgns_prefix) {
> >>> + pr_info("setting cgns prefix to %s\n", ce->cgns_prefix);
> >>> + snprintf(aux + aux_off, sizeof(aux) - aux_off, "/%s/tasks", ce->cgns_prefix);
> >>> + if (userns_call(userns_move, UNS_ASYNC, aux, strlen(aux) + 1, -1) < 0) {
> >>
> >> Why do we need to do move here? The 2nd would occur several lines below anyway.
> >
> > Consider a task in a cgroup (according to the host):
> >
> > /unsprefix/insidecontainer
> >
> > If the task first moved itself into /unsprefix, then did unshare(),
> > when the task examines its own /proc/self/cgroup file it will see /,
> > but to the host it is really in /unsprefix. Then if it further enters
> > /insidecontainer here, the full host path will be
> > /unsprefix/insidecontianer. There is no way to say "set the cgroup
> > namespace boundary at /unsprefix" without first entering that, doing
> > the unshare, and then entering the rest of the path.
>
> Ah! I see. The unshare pins root of the task's namespace to those that
> task currently lives in. Would you add this as code comment here, please?
Yep, np.
> >>> + pr_perror("couldn't set cgns prefix %s", aux);
> >>> + return -1;
> >>> + }
> >>> +
> >>> + if (unshare(CLONE_NEWCGROUP) < 0) {
> >>> + pr_perror("couldn't unshare cgns");
> >>> + return -1;
> >>> + }
> >>> + }
> >>> +
> >>> + /* Note that unshare(CLONE_NEWCGROUP) doesn't change the view
> >>> + * of previously mounted cgroupfses; since we're restoring via
> >>> + * a dirfd pointing to the cg yard set up by when criu was in
> >>> + * the root cgns, we still want to use the full path here when
> >>> + * we move into the cgroup.
> >>> + */
> >>> snprintf(aux + aux_off, sizeof(aux) - aux_off, "/%s/tasks", ce->path);
> >>> pr_debug(" `-> %s\n", aux);
> >>> err = userns_call(userns_move, UNS_ASYNC, aux, strlen(aux) + 1, -1);
> >>
> >>> @@ -727,8 +728,23 @@ static int dump_task_core_all(struct parasite_ctl *ctl,
> >>> if (ret)
> >>> goto err;
> >>>
> >>> + /* If this is the root task and it has a cgroup ns id, it could be in
> >>> + * a cgroup namespace and we should try to figure out the prefix. Or,
> >>> + * if the task is not the parent task and its cgroup namespace differs
> >>> + * from its parent's, this is a nested cgns and we should compute the
> >>> + * prefix.
> >>> + */
> >>> + if ((!item->parent && item->ids->has_cgroup_ns_id) ||
> >>
> >> The has_cgroup_ns_id is always true on dump.
> >
> > Only if /proc/self/ns/cgroup exists, on older kernels it will be false
> > and so we don't need to bother with checking the task's
> > /proc/self/cgroup from inside.
>
> But this hunk
>
> @@ -488,6 +494,13 @@ int dump_task_ns_ids(struct pstree_item *item)
> return -1;
> }
>
> + ids->has_cgroup_ns_id = true;
> + ids->cgroup_ns_id = get_ns_id(pid, &cgroup_ns_desc);
> + if (!ids->cgroup_ns_id) {
> + pr_err("Can't make cgroup id\n");
> + return -1;
> + }
> +
> return 0;
> }
>
> always sets it to true. What am I missing?
Urgh. I think that hunk should only set it to true if there actually
is a cgroup namespace. I'll fix it in the next version.
> >>> + (item->parent->ids->has_cgroup_ns_id &&
> >>> + item->ids->cgroup_ns_id != item->parent->ids->cgroup_ns_id)) {
> >>> + info = &cgroup_args;
> >>> + ret = parasite_dump_cgroup(ctl, &cgroup_args);
> >>> + if (ret)
> >>> + goto err;
> >>> + }
> >>> +
> >>> core->tc->has_cg_set = true;
> >>> - ret = dump_task_cgroup(item, &core->tc->cg_set);
> >>> + ret = dump_task_cgroup(item, &core->tc->cg_set, info);
> >>> if (ret)
> >>> goto err;
> >>>
> >>
> >>> @@ -245,6 +246,16 @@ struct parasite_tty_args {
> >>> int st_excl;
> >>> };
> >>>
> >>> +struct parasite_dump_cgroup_args {
> >>> + /* We choose PAGE_SIZE here since that's how big parasite messages are,
> >>> + * although this is probably longer than any /proc/pid/cgroup file will
> >>> + * ever be on most systems (4k).
> >>> + *
> >>> + * The string is null terminated.
> >>> + */
> >>> + char contents[PAGE_SIZE];
> >>
> >> This would contain the path to cgroup as task sees it. Why not keep here the
> >> path's lenght, the full path would anyway be found out when dumping cgroups
> >> themselves?
> >
> > I can do that too, mostly I just didn't want to move the cgroup
> > parsing code to pie :). This will take some goofy memory mapping
> > stuff, since we need to have a map from controller => path length, but
> > I think that's ok.
>
> Cool :)
>
> >>> +};
> >>> +
> >>> /* the parasite prefix is added by gen_offsets.sh */
> >>> #define parasite_sym(pblob, name) ((void *)(pblob) + parasite_blob_offset__##name)
> >>>
> >>
> >>> @@ -503,6 +511,41 @@ static inline int parasite_check_vdso_mark(struct parasite_vdso_vma_entry *args)
> >>> }
> >>> #endif
> >>>
> >>> +static int parasite_dump_cgroup(struct parasite_dump_cgroup_args *args)
> >>> +{
> >>> + int proc, cgroup, len;
> >>> +
> >>> + proc = get_proc_fd();
> >>
> >> Isn't it too heavyweight (mkdir + mount + open + umount) for just opening the /proc?
> >
> > How else will you do it? You need to be sure that the /proc you're
> > looking at is the right one for the task.
>
> I was thinking that cgns support is only meaningful for "true" containers
> in which /proc is always alive, so we can just open it and work. The get_proc_fd
> is for screwed environments w/o /proc mountpoint at all.
Right, but in the common case we won't do mkdir/mount/open, because
the /proc for the container is valid.
> >>> + ret = -1;
> >>> + goto out;
> >>> + }
> >>> +
> >>> + tmp = *pos;
> >>> + *pos = '\0';
> >>> + ext->cgns_prefix = xstrdup(ext->path);
> >>> + if (!ext->cgns_prefix) {
> >>> + ret = -1;
> >>> + goto out;
> >>> + }
> >>> + *pos = tmp;
> >>> + }
> >>> + }
> >>
> >> Erm... I'm not sure I fully understand what's going on in this routine.
> >> Would you comment on it please?
> >
> > Sure, I can add some comments. Something like:
> >
> > /* Here's where we actually compute the cgns prefix. Consider a task
> > * in /foo/bar which has unshared its namespace at /foo. The internal
> > * path is /bar, but the external path is /foo/bar, and the cgns
> > * prefix is /foo. The algorithm is:
> > *
> > * // no cg ns unshare in this case
> > * if (internal == external)
> > * continue;
> > * idx = find_suffix_pos(external, internal)
> > * cgns_prefix = external[:idx]
> > */
> >
> > Does that make sense?
>
> Yes, please :)
>
> We have working day tomorrow, so if you send the updated set within 24 hours
> it will be easier for me to apply one :)
Bah, sorry. I was out Friday + Saturday. I'll be around today and
tomorrow, though.
Tycho
More information about the CRIU
mailing list