[CRIU] [PATCH 2/7] cgroup: add support for cgroup namespaces

Sun Feb 21 08:29:15 PST 2016

On Fri, Feb 19, 2016 at 09:09:59PM +0300, Pavel Emelyanov wrote:
> 
> >>>  {
> >>>  	int i;
> >>>  
> >>> @@ -1023,6 +1030,26 @@ static int move_in_cgroup(CgSetEntry *se)
> >>>  
> >>>  		aux_off = ctrl_dir_and_opt(ctrl, aux, sizeof(aux), NULL, 0);
> >>>  
> >>> +		if (do_cgns_set && ce->cgns_prefix) {
> >>> +			pr_info("setting cgns prefix to %s\n", ce->cgns_prefix);
> >>> +			snprintf(aux + aux_off, sizeof(aux) - aux_off, "/%s/tasks", ce->cgns_prefix);
> >>> +			if (userns_call(userns_move, UNS_ASYNC, aux, strlen(aux) + 1, -1) < 0) {
> >>
> >> Why do we need to do move here? The 2nd would occur several lines below anyway.
> > 
> > Consider a task in a cgroup (according to the host):
> > 
> > /unsprefix/insidecontainer
> > 
> > If the task first moved itself into /unsprefix, then did unshare(),
> > when the task examines its own /proc/self/cgroup file it will see /,
> > but to the host it is really in /unsprefix. Then if it further enters
> > /insidecontainer here, the full host path will be
> > /unsprefix/insidecontianer. There is no way to say "set the cgroup
> > namespace boundary at /unsprefix" without first entering that, doing
> > the unshare, and then entering the rest of the path.
> 
> Ah! I see. The unshare pins root of the task's namespace to those that
> task currently lives in. Would you add this as code comment here, please?

Yep, np.

> >>> +				pr_perror("couldn't set cgns prefix %s", aux);
> >>> +				return -1;
> >>> +			}
> >>> +
> >>> +			if (unshare(CLONE_NEWCGROUP) < 0) {
> >>> +				pr_perror("couldn't unshare cgns");
> >>> +				return -1;
> >>> +			}
> >>> +		}
> >>> +
> >>> +		/* Note that unshare(CLONE_NEWCGROUP) doesn't change the view
> >>> +		 * of previously mounted cgroupfses; since we're restoring via
> >>> +		 * a dirfd pointing to the cg yard set up by when criu was in
> >>> +		 * the root cgns, we still want to use the full path here when
> >>> +		 * we move into the cgroup.
> >>> +		 */
> >>>  		snprintf(aux + aux_off, sizeof(aux) - aux_off, "/%s/tasks", ce->path);
> >>>  		pr_debug("  `-> %s\n", aux);
> >>>  		err = userns_call(userns_move, UNS_ASYNC, aux, strlen(aux) + 1, -1);
> >>
> >>> @@ -727,8 +728,23 @@ static int dump_task_core_all(struct parasite_ctl *ctl,
> >>>  	if (ret)
> >>>  		goto err;
> >>>  
> >>> +	/* If this is the root task and it has a cgroup ns id, it could be in
> >>> +	 * a cgroup namespace and we should try to figure out the prefix. Or,
> >>> +	 * if the task is not the parent task and its cgroup namespace differs
> >>> +	 * from its parent's, this is a nested cgns and we should compute the
> >>> +	 * prefix.
> >>> +	 */
> >>> +	if ((!item->parent && item->ids->has_cgroup_ns_id) ||
> >>
> >> The has_cgroup_ns_id is always true on dump.
> > 
> > Only if /proc/self/ns/cgroup exists, on older kernels it will be false
> > and so we don't need to bother with checking the task's
> > /proc/self/cgroup from inside.
> 
> But this hunk
> 
> @@ -488,6 +494,13 @@ int dump_task_ns_ids(struct pstree_item *item)
>                 return -1;
>         }
> 
> +       ids->has_cgroup_ns_id = true;
> +       ids->cgroup_ns_id = get_ns_id(pid, &cgroup_ns_desc);
> +       if (!ids->cgroup_ns_id) {
> +               pr_err("Can't make cgroup id\n");
> +               return -1;
> +       }
> +
>         return 0;
>  }
> 
> always sets it to true. What am I missing?

Urgh. I think that hunk should only set it to true if there actually
is a cgroup namespace. I'll fix it in the next version.

> >>> +			(item->parent->ids->has_cgroup_ns_id &&
> >>> +			 item->ids->cgroup_ns_id != item->parent->ids->cgroup_ns_id)) {
> >>> +		info = &cgroup_args;
> >>> +		ret = parasite_dump_cgroup(ctl, &cgroup_args);
> >>> +		if (ret)
> >>> +			goto err;
> >>> +	}
> >>> +
> >>>  	core->tc->has_cg_set = true;
> >>> -	ret = dump_task_cgroup(item, &core->tc->cg_set);
> >>> +	ret = dump_task_cgroup(item, &core->tc->cg_set, info);
> >>>  	if (ret)
> >>>  		goto err;
> >>>  
> >>
> >>> @@ -245,6 +246,16 @@ struct parasite_tty_args {
> >>>  	int	st_excl;
> >>>  };
> >>>  
> >>> +struct parasite_dump_cgroup_args {
> >>> +	/* We choose PAGE_SIZE here since that's how big parasite messages are,
> >>> +	 * although this is probably longer than any /proc/pid/cgroup file will
> >>> +	 * ever be on most systems (4k).
> >>> +	 *
> >>> +	 * The string is null terminated.
> >>> +	 */
> >>> +	char contents[PAGE_SIZE];
> >>
> >> This would contain the path to cgroup as task sees it. Why not keep here the
> >> path's lenght, the full path would anyway be found out when dumping cgroups
> >> themselves?
> > 
> > I can do that too, mostly I just didn't want to move the cgroup
> > parsing code to pie :). This will take some goofy memory mapping
> > stuff, since we need to have a map from controller => path length, but
> > I think that's ok.
> 
> Cool :)
> 
> >>> +};
> >>> +
> >>>  /* the parasite prefix is added by gen_offsets.sh */
> >>>  #define parasite_sym(pblob, name) ((void *)(pblob) + parasite_blob_offset__##name)
> >>>  
> >>
> >>> @@ -503,6 +511,41 @@ static inline int parasite_check_vdso_mark(struct parasite_vdso_vma_entry *args)
> >>>  }
> >>>  #endif
> >>>  
> >>> +static int parasite_dump_cgroup(struct parasite_dump_cgroup_args *args)
> >>> +{
> >>> +	int proc, cgroup, len;
> >>> +
> >>> +	proc = get_proc_fd();
> >>
> >> Isn't it too heavyweight (mkdir + mount + open + umount) for just opening the /proc?
> > 
> > How else will you do it? You need to be sure that the /proc you're
> > looking at is the right one for the task.
> 
> I was thinking that cgns support is only meaningful for "true" containers
> in which /proc is always alive, so we can just open it and work. The get_proc_fd
> is for screwed environments w/o /proc mountpoint at all.

Right, but in the common case we won't do mkdir/mount/open, because
the /proc for the container is valid.

> >>> +				ret = -1;
> >>> +				goto out;
> >>> +			}
> >>> +
> >>> +			tmp = *pos;
> >>> +			*pos = '\0';
> >>> +			ext->cgns_prefix = xstrdup(ext->path);
> >>> +			if (!ext->cgns_prefix) {
> >>> +				ret = -1;
> >>> +				goto out;
> >>> +			}
> >>> +			*pos = tmp;
> >>> +		}
> >>> +	}
> >>
> >> Erm... I'm not sure I fully understand what's going on in this routine.
> >> Would you comment on it please?
> > 
> > Sure, I can add some comments. Something like:
> > 
> > /* Here's where we actually compute the cgns prefix. Consider a task
> >  * in /foo/bar which has unshared its namespace at /foo. The internal
> >  * path is /bar, but the external path is /foo/bar, and the cgns
> >  * prefix is /foo. The algorithm is:
> >  *
> >  * // no cg ns unshare in this case
> >  * if (internal == external)
> >  *   continue;
> >  * idx = find_suffix_pos(external, internal)
> >  * cgns_prefix = external[:idx]
> >  */
> > 
> > Does that make sense?
> 
> Yes, please :)
> 
> We have working day tomorrow, so if you send the updated set within 24 hours
> it will be easier for me to apply one :)

Bah, sorry. I was out Friday + Saturday. I'll be around today and
tomorrow, though.

Tycho