[CRIU] [PATCH 2/7] cgroup: add support for cgroup namespaces

Fri Feb 19 10:09:59 PST 2016

>>>  {
>>>  	int i;
>>>  
>>> @@ -1023,6 +1030,26 @@ static int move_in_cgroup(CgSetEntry *se)
>>>  
>>>  		aux_off = ctrl_dir_and_opt(ctrl, aux, sizeof(aux), NULL, 0);
>>>  
>>> +		if (do_cgns_set && ce->cgns_prefix) {
>>> +			pr_info("setting cgns prefix to %s\n", ce->cgns_prefix);
>>> +			snprintf(aux + aux_off, sizeof(aux) - aux_off, "/%s/tasks", ce->cgns_prefix);
>>> +			if (userns_call(userns_move, UNS_ASYNC, aux, strlen(aux) + 1, -1) < 0) {
>>
>> Why do we need to do move here? The 2nd would occur several lines below anyway.
> 
> Consider a task in a cgroup (according to the host):
> 
> /unsprefix/insidecontainer
> 
> If the task first moved itself into /unsprefix, then did unshare(),
> when the task examines its own /proc/self/cgroup file it will see /,
> but to the host it is really in /unsprefix. Then if it further enters
> /insidecontainer here, the full host path will be
> /unsprefix/insidecontianer. There is no way to say "set the cgroup
> namespace boundary at /unsprefix" without first entering that, doing
> the unshare, and then entering the rest of the path.

Ah! I see. The unshare pins root of the task's namespace to those that
task currently lives in. Would you add this as code comment here, please?

>>> +				pr_perror("couldn't set cgns prefix %s", aux);
>>> +				return -1;
>>> +			}
>>> +
>>> +			if (unshare(CLONE_NEWCGROUP) < 0) {
>>> +				pr_perror("couldn't unshare cgns");
>>> +				return -1;
>>> +			}
>>> +		}
>>> +
>>> +		/* Note that unshare(CLONE_NEWCGROUP) doesn't change the view
>>> +		 * of previously mounted cgroupfses; since we're restoring via
>>> +		 * a dirfd pointing to the cg yard set up by when criu was in
>>> +		 * the root cgns, we still want to use the full path here when
>>> +		 * we move into the cgroup.
>>> +		 */
>>>  		snprintf(aux + aux_off, sizeof(aux) - aux_off, "/%s/tasks", ce->path);
>>>  		pr_debug("  `-> %s\n", aux);
>>>  		err = userns_call(userns_move, UNS_ASYNC, aux, strlen(aux) + 1, -1);
>>
>>> @@ -727,8 +728,23 @@ static int dump_task_core_all(struct parasite_ctl *ctl,
>>>  	if (ret)
>>>  		goto err;
>>>  
>>> +	/* If this is the root task and it has a cgroup ns id, it could be in
>>> +	 * a cgroup namespace and we should try to figure out the prefix. Or,
>>> +	 * if the task is not the parent task and its cgroup namespace differs
>>> +	 * from its parent's, this is a nested cgns and we should compute the
>>> +	 * prefix.
>>> +	 */
>>> +	if ((!item->parent && item->ids->has_cgroup_ns_id) ||
>>
>> The has_cgroup_ns_id is always true on dump.
> 
> Only if /proc/self/ns/cgroup exists, on older kernels it will be false
> and so we don't need to bother with checking the task's
> /proc/self/cgroup from inside.

But this hunk

@@ -488,6 +494,13 @@ int dump_task_ns_ids(struct pstree_item *item)
                return -1;
        }

+       ids->has_cgroup_ns_id = true;
+       ids->cgroup_ns_id = get_ns_id(pid, &cgroup_ns_desc);
+       if (!ids->cgroup_ns_id) {
+               pr_err("Can't make cgroup id\n");
+               return -1;
+       }
+
        return 0;
 }

always sets it to true. What am I missing?

>>> +			(item->parent->ids->has_cgroup_ns_id &&
>>> +			 item->ids->cgroup_ns_id != item->parent->ids->cgroup_ns_id)) {
>>> +		info = &cgroup_args;
>>> +		ret = parasite_dump_cgroup(ctl, &cgroup_args);
>>> +		if (ret)
>>> +			goto err;
>>> +	}
>>> +
>>>  	core->tc->has_cg_set = true;
>>> -	ret = dump_task_cgroup(item, &core->tc->cg_set);
>>> +	ret = dump_task_cgroup(item, &core->tc->cg_set, info);
>>>  	if (ret)
>>>  		goto err;
>>>  
>>
>>> @@ -245,6 +246,16 @@ struct parasite_tty_args {
>>>  	int	st_excl;
>>>  };
>>>  
>>> +struct parasite_dump_cgroup_args {
>>> +	/* We choose PAGE_SIZE here since that's how big parasite messages are,
>>> +	 * although this is probably longer than any /proc/pid/cgroup file will
>>> +	 * ever be on most systems (4k).
>>> +	 *
>>> +	 * The string is null terminated.
>>> +	 */
>>> +	char contents[PAGE_SIZE];
>>
>> This would contain the path to cgroup as task sees it. Why not keep here the
>> path's lenght, the full path would anyway be found out when dumping cgroups
>> themselves?
> 
> I can do that too, mostly I just didn't want to move the cgroup
> parsing code to pie :). This will take some goofy memory mapping
> stuff, since we need to have a map from controller => path length, but
> I think that's ok.

Cool :)

>>> +};
>>> +
>>>  /* the parasite prefix is added by gen_offsets.sh */
>>>  #define parasite_sym(pblob, name) ((void *)(pblob) + parasite_blob_offset__##name)
>>>  
>>
>>> @@ -503,6 +511,41 @@ static inline int parasite_check_vdso_mark(struct parasite_vdso_vma_entry *args)
>>>  }
>>>  #endif
>>>  
>>> +static int parasite_dump_cgroup(struct parasite_dump_cgroup_args *args)
>>> +{
>>> +	int proc, cgroup, len;
>>> +
>>> +	proc = get_proc_fd();
>>
>> Isn't it too heavyweight (mkdir + mount + open + umount) for just opening the /proc?
> 
> How else will you do it? You need to be sure that the /proc you're
> looking at is the right one for the task.

I was thinking that cgns support is only meaningful for "true" containers
in which /proc is always alive, so we can just open it and work. The get_proc_fd
is for screwed environments w/o /proc mountpoint at all.

>>> +				ret = -1;
>>> +				goto out;
>>> +			}
>>> +
>>> +			tmp = *pos;
>>> +			*pos = '\0';
>>> +			ext->cgns_prefix = xstrdup(ext->path);
>>> +			if (!ext->cgns_prefix) {
>>> +				ret = -1;
>>> +				goto out;
>>> +			}
>>> +			*pos = tmp;
>>> +		}
>>> +	}
>>
>> Erm... I'm not sure I fully understand what's going on in this routine.
>> Would you comment on it please?
> 
> Sure, I can add some comments. Something like:
> 
> /* Here's where we actually compute the cgns prefix. Consider a task
>  * in /foo/bar which has unshared its namespace at /foo. The internal
>  * path is /bar, but the external path is /foo/bar, and the cgns
>  * prefix is /foo. The algorithm is:
>  *
>  * // no cg ns unshare in this case
>  * if (internal == external)
>  *   continue;
>  * idx = find_suffix_pos(external, internal)
>  * cgns_prefix = external[:idx]
>  */
> 
> Does that make sense?

Yes, please :)

We have working day tomorrow, so if you send the updated set within 24 hours
it will be easier for me to apply one :)

-- Pavel