[CRIU] [PATCH 2/2] Attempt to restore cgroups

Fri Jul 4 09:44:57 PDT 2014

Hi Pavel,

On Fri, Jul 04, 2014 at 02:56:14PM +0400, Pavel Emelyanov wrote:
> On 07/03/2014 06:23 PM, Tycho Andersen wrote:
> > Ugh, and here's one without the stupid debug statement again, sorry
> > about that.
> 
> Would you also run this patch through linux kernel's scripts/checkpatch.pl
> script. It finds ~20 coding style errors. Warnings can be ignored :)
> 
> And some more comments inline.
> 
> > During the dump phase, /proc/cgroups is parsed to find co-mounted cgroups.
> > Then, for each task /proc/self/cgroup is parsed for the cgroups that it is a
> > member of, and that cgroup is traversed to find any child cgroups which may
> > also need restoring. All of this information is persisted along with the
> > original cg_sets, which indicate which cgroups a task is a member of.
> > 
> > On restore, an initial phase creates all the cgroups which were saved and
> > attempts to restore any peroperties they had. Then the tasks are restored into
> > their respective cgroups via cg_sets as usual.
> 
> Can you also add test into zdtm/live/static/ suite? These are run on 24x7 manner
> and make sure nothing gets broken accidentally.
> 
> > Signed-off-by: Tycho Andersen <tycho.andersen at canonical.com>
> > ---
> >  cgroup.c              | 449 +++++++++++++++++++++++++++++++++++++++++++++++---
> >  cr-dump.c             |   3 +
> >  include/cgroup.h      |  46 ++++++
> >  include/proc_parse.h  |   3 +
> >  include/util.h        |   7 +-
> >  mount.c               |   3 +
> >  proc_parse.c          |  75 ++++++++-
> >  protobuf/cgroup.proto |  16 +-
> >  protobuf/mnt.proto    |   1 +
> >  util.c                |  34 ++++
> >  10 files changed, 609 insertions(+), 28 deletions(-)
> > 
> > diff --git a/cgroup.c b/cgroup.c
> > index 1fe5e6d..306bfe3 100644
> > --- a/cgroup.c
> > +++ b/cgroup.c
> > @@ -18,7 +20,8 @@
> >  /*
> >   * This structure describes set of controller groups
> >   * a task lives in. The cg_ctl entries are stored in
> > - * the @ctls list sorted by the .name field.
> > + * the @ctls list sorted by the .name field and then
> > + * by the .path field.
> 
> Why do we need the .path sorting as well?

I was thinking of the case when something is in both /lxc/u1 and /lxc,
if you visit /lxc/u1 first, it goes in heads and then /lxc does as
well. If we maintain this list as a sorted list, we will always visit
/lxc before /lxc/u1. This is why we still have the _MATCHing below as
well.

> > +
> > +		switch(mtype) {
> > +			/* ignore co-mounted cgroups */
> > +			case EXACT_MATCH :
> > +				goto out;
> > +			case PARENT_MATCH :
> > +				list_add_tail(&ncd->siblings, &match->children);
> > +				match->n_children++;
> > +				break;
> > +			case NO_MATCH :
> > +				list_add_tail(&ncd->siblings, &current_controller->heads);
> > +				current_controller->n_heads++;
> > +				break;
> > +		}
> > +
> > +		INIT_LIST_HEAD(&ncd->children);
> > +		ncd->n_children = 0;
> > +		ncd->controller = current_controller;
> > +
> > +		ncd->flags = 0;
> > +
> > +		snprintf(pbuf, PATH_MAX, "%s/memory.limit_in_bytes", fpath);
> > +		f = fopen(pbuf, "r");
> > +		if (f) {
> > +			if (fscanf(f, "%" SCNu64, &ncd->mem_limit) != 1) {
> > +				pr_err("Failed scanning %s\n", pbuf);
> > +				ret = -1;
> > +				goto out;
> > +			}
> > +			ncd->flags |= HAS_MEM_LIMIT;
> > +			fclose(f);
> > +		}
> > +
> > +		snprintf(pbuf, PATH_MAX, "%s/cpu.shares", fpath);
> > +		f = fopen(pbuf, "r");
> > +		if (f) {
> > +			if (fscanf(f, "%" SCNu32, &ncd->cpu_shares) != 1) {
> > +				pr_err("Failed scanning %s for u32\n", pbuf);
> > +				ret = -1;
> > +				goto out;
> > +			}
> > +			ncd->flags |= HAS_CPU_SHARES;
> > +			fclose(f);
> > +		}
> > +
> > +		return 0;
> > +	}
> > +
> > +out:
> > +	if (ncd) {
> > +		if (ncd->path)
> > +			xfree(ncd->path);
> > +		xfree(ncd);
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static int collect_cgroups(struct list_head *ctls)
> > +{
> > +	struct cg_ctl *cc;
> > +	int ret = 0;
> > +
> > +	list_for_each_entry(cc, ctls, l) {
> > +		char path[PATH_MAX];
> > +		char *name, *mount_point;
> > +		struct cg_controller *cg;
> > +		int i;
> > +
> > +		if (strstartswith(cc->name, "name="))
> > +			name = cc->name + 5;
> > +		else
> > +			name = cc->name;
> > +
> > +		mount_point = get_cgroup_mount_point(name);
> > +		if (!mount_point) {
> > +			pr_err("cgroup mount point for %s not found!\n", name);
> > +			return -1;
> > +		}
> > +
> > +		snprintf(path, PATH_MAX, "%s/%s%s", mount_point, name, cc->path);
> > +
> > +		current_controller = NULL;
> > +		current_controller_name = name;
> > +
> > +		/* Use the previously allocated struct for this controller if
> > +		 * there is one */
> 
> I'm not sure I understand this comment. Can you qualify one, please?

When we parse /proc/cgroups we allocate a cg_controller for each
entry, this is just looking for the previously allocated cg_controller
entry. (It might not be there, e.g. with systemd, which is why we
still might potentially allocate a new one.)

> > +		list_for_each_entry(cg, &cgroups, l) {
> > +			for (i = 0; i < cg->n_controllers; i++) {
> > +				if (strcmp(cg->controllers[i], cc->name) == 0) {
> > +					current_controller = cg;
> > +					break;
> > +				}
> > +			}
> > +		}
> > +
> > +		if (!current_controller) {
> > +			/* only allow "fake" controllers to be created this way */
> > +			if (!strstartswith(cc->name, "name=")) {
> > +				pr_err("controller %s not found\n", cc->name);
> > +				return -1;
> > +			} else {
> > +				struct cg_controller *nc = new_controller(cc->name, -1);
> > +				list_add_tail(&nc->l, &cg->l);
> > +				n_cgroups++;
> > +				current_controller = nc;
> > +			}
> > +		}
> > +
> > +		if ((ret = ftw(path, add_cgroup, 4)) < 0) {
> > +			pr_perror("failed walking %s for empty cgroups\n", path);
> > +			return ret;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  int dump_task_cgroup(struct pstree_item *item, u32 *cg_id)
> >  {
> >  	int pid;
> 
> > @@ -431,15 +840,5 @@ int prepare_cgroup(void)
> >  
> >  	n_sets = ce->n_sets;
> >  	rst_sets = ce->sets;
> > -	if (n_sets)
> > -		/*
> > -		 * We rely on the fact that all sets contain the same
> > -		 * set of controllers. This is checked during dump
> > -		 * with cg_set_compare(CGCMP_ISSUB) call.
> > -		 */
> > -		ret = prepare_cgroup_sfd(rst_sets[0]);
> > -	else
> > -		ret = 0;
> > -
> > -	return ret;
> > +	return prepare_cgroup_sfd(ce);
> 
> This if (n_sets) handled the case when we didn't have cgroups to dump.
> E.g. -- when we just dump the tasks that live in the same cgroup set as
> criu does. Will this hunk break this?

I guess it will mount the cgroup yard but not actually put anything in
it, so I will put the if back.

> >  }
> 
> > diff --git a/mount.c b/mount.c
> > index 4d84f48..32410eb 100644
> > --- a/mount.c
> > +++ b/mount.c
> > @@ -861,6 +861,9 @@ static struct fstype fstypes[] = {
> >  	}, {
> >  		.name = "debugfs",
> >  		.code = FSTYPE__DEBUGFS,
> > +	}, {
> > +		.name = "cgroup",
> > +		.code = FSTYPE__CGROUP,
> 
> This would only handle the case when cgroup root is mounted inside container.
> AFAIK cgroups are typically bind-mounted into container from the directory
> where the container lives in. Maybe it makes sense to implement full cgroup
> mount support separately?

The only reason I added this was because parse_mountinfo_ent() won't
record the fs type (it just says "unsupported") unless it is present
in this list, and I need it to compare against to only match the
cgroup fs. I guess another option would be to store the "raw" fstype
in the structure as well, then we wouldn't need this hunk.

I will make the other changes as well.

Thanks,

Tycho