[CRIU] [PATCH 2/7] cgroup: add support for cgroup namespaces
Pavel Emelyanov
xemul at virtuozzo.com
Fri Feb 19 10:09:59 PST 2016
>>> {
>>> int i;
>>>
>>> @@ -1023,6 +1030,26 @@ static int move_in_cgroup(CgSetEntry *se)
>>>
>>> aux_off = ctrl_dir_and_opt(ctrl, aux, sizeof(aux), NULL, 0);
>>>
>>> + if (do_cgns_set && ce->cgns_prefix) {
>>> + pr_info("setting cgns prefix to %s\n", ce->cgns_prefix);
>>> + snprintf(aux + aux_off, sizeof(aux) - aux_off, "/%s/tasks", ce->cgns_prefix);
>>> + if (userns_call(userns_move, UNS_ASYNC, aux, strlen(aux) + 1, -1) < 0) {
>>
>> Why do we need to do move here? The 2nd would occur several lines below anyway.
>
> Consider a task in a cgroup (according to the host):
>
> /unsprefix/insidecontainer
>
> If the task first moved itself into /unsprefix, then did unshare(),
> when the task examines its own /proc/self/cgroup file it will see /,
> but to the host it is really in /unsprefix. Then if it further enters
> /insidecontainer here, the full host path will be
> /unsprefix/insidecontianer. There is no way to say "set the cgroup
> namespace boundary at /unsprefix" without first entering that, doing
> the unshare, and then entering the rest of the path.
Ah! I see. The unshare pins root of the task's namespace to those that
task currently lives in. Would you add this as code comment here, please?
>>> + pr_perror("couldn't set cgns prefix %s", aux);
>>> + return -1;
>>> + }
>>> +
>>> + if (unshare(CLONE_NEWCGROUP) < 0) {
>>> + pr_perror("couldn't unshare cgns");
>>> + return -1;
>>> + }
>>> + }
>>> +
>>> + /* Note that unshare(CLONE_NEWCGROUP) doesn't change the view
>>> + * of previously mounted cgroupfses; since we're restoring via
>>> + * a dirfd pointing to the cg yard set up by when criu was in
>>> + * the root cgns, we still want to use the full path here when
>>> + * we move into the cgroup.
>>> + */
>>> snprintf(aux + aux_off, sizeof(aux) - aux_off, "/%s/tasks", ce->path);
>>> pr_debug(" `-> %s\n", aux);
>>> err = userns_call(userns_move, UNS_ASYNC, aux, strlen(aux) + 1, -1);
>>
>>> @@ -727,8 +728,23 @@ static int dump_task_core_all(struct parasite_ctl *ctl,
>>> if (ret)
>>> goto err;
>>>
>>> + /* If this is the root task and it has a cgroup ns id, it could be in
>>> + * a cgroup namespace and we should try to figure out the prefix. Or,
>>> + * if the task is not the parent task and its cgroup namespace differs
>>> + * from its parent's, this is a nested cgns and we should compute the
>>> + * prefix.
>>> + */
>>> + if ((!item->parent && item->ids->has_cgroup_ns_id) ||
>>
>> The has_cgroup_ns_id is always true on dump.
>
> Only if /proc/self/ns/cgroup exists, on older kernels it will be false
> and so we don't need to bother with checking the task's
> /proc/self/cgroup from inside.
But this hunk
@@ -488,6 +494,13 @@ int dump_task_ns_ids(struct pstree_item *item)
return -1;
}
+ ids->has_cgroup_ns_id = true;
+ ids->cgroup_ns_id = get_ns_id(pid, &cgroup_ns_desc);
+ if (!ids->cgroup_ns_id) {
+ pr_err("Can't make cgroup id\n");
+ return -1;
+ }
+
return 0;
}
always sets it to true. What am I missing?
>>> + (item->parent->ids->has_cgroup_ns_id &&
>>> + item->ids->cgroup_ns_id != item->parent->ids->cgroup_ns_id)) {
>>> + info = &cgroup_args;
>>> + ret = parasite_dump_cgroup(ctl, &cgroup_args);
>>> + if (ret)
>>> + goto err;
>>> + }
>>> +
>>> core->tc->has_cg_set = true;
>>> - ret = dump_task_cgroup(item, &core->tc->cg_set);
>>> + ret = dump_task_cgroup(item, &core->tc->cg_set, info);
>>> if (ret)
>>> goto err;
>>>
>>
>>> @@ -245,6 +246,16 @@ struct parasite_tty_args {
>>> int st_excl;
>>> };
>>>
>>> +struct parasite_dump_cgroup_args {
>>> + /* We choose PAGE_SIZE here since that's how big parasite messages are,
>>> + * although this is probably longer than any /proc/pid/cgroup file will
>>> + * ever be on most systems (4k).
>>> + *
>>> + * The string is null terminated.
>>> + */
>>> + char contents[PAGE_SIZE];
>>
>> This would contain the path to cgroup as task sees it. Why not keep here the
>> path's lenght, the full path would anyway be found out when dumping cgroups
>> themselves?
>
> I can do that too, mostly I just didn't want to move the cgroup
> parsing code to pie :). This will take some goofy memory mapping
> stuff, since we need to have a map from controller => path length, but
> I think that's ok.
Cool :)
>>> +};
>>> +
>>> /* the parasite prefix is added by gen_offsets.sh */
>>> #define parasite_sym(pblob, name) ((void *)(pblob) + parasite_blob_offset__##name)
>>>
>>
>>> @@ -503,6 +511,41 @@ static inline int parasite_check_vdso_mark(struct parasite_vdso_vma_entry *args)
>>> }
>>> #endif
>>>
>>> +static int parasite_dump_cgroup(struct parasite_dump_cgroup_args *args)
>>> +{
>>> + int proc, cgroup, len;
>>> +
>>> + proc = get_proc_fd();
>>
>> Isn't it too heavyweight (mkdir + mount + open + umount) for just opening the /proc?
>
> How else will you do it? You need to be sure that the /proc you're
> looking at is the right one for the task.
I was thinking that cgns support is only meaningful for "true" containers
in which /proc is always alive, so we can just open it and work. The get_proc_fd
is for screwed environments w/o /proc mountpoint at all.
>>> + ret = -1;
>>> + goto out;
>>> + }
>>> +
>>> + tmp = *pos;
>>> + *pos = '\0';
>>> + ext->cgns_prefix = xstrdup(ext->path);
>>> + if (!ext->cgns_prefix) {
>>> + ret = -1;
>>> + goto out;
>>> + }
>>> + *pos = tmp;
>>> + }
>>> + }
>>
>> Erm... I'm not sure I fully understand what's going on in this routine.
>> Would you comment on it please?
>
> Sure, I can add some comments. Something like:
>
> /* Here's where we actually compute the cgns prefix. Consider a task
> * in /foo/bar which has unshared its namespace at /foo. The internal
> * path is /bar, but the external path is /foo/bar, and the cgns
> * prefix is /foo. The algorithm is:
> *
> * // no cg ns unshare in this case
> * if (internal == external)
> * continue;
> * idx = find_suffix_pos(external, internal)
> * cgns_prefix = external[:idx]
> */
>
> Does that make sense?
Yes, please :)
We have working day tomorrow, so if you send the updated set within 24 hours
it will be easier for me to apply one :)
-- Pavel
More information about the CRIU
mailing list