[CRIU] [PATCH v2 2/2] restore: correctly restore cgroup mounts inside a container

Fri Mar 25 11:04:33 PDT 2016

On Fri, Mar 25, 2016 at 08:37:48PM +0300, Pavel Emelyanov wrote:
> On 03/24/2016 08:09 PM, Tycho Andersen wrote:
> > Before the nsroot= mount option, we were just getting lucky because the
> > cgroup superblocks "matched" when inspecting them from userspace, so we
> > were actually getting a bind mount from the host when migrating from within
> > cgroup namespaces.
> > 
> > Instead, let's actually do a new (i.e. not a bind mount) for cgroup
> > namespaces. For this, we need two things:
> > 
> > 1. to prepare the cgroup namespace (and thus the cgroups) before the mount
> >    ns, so when the mount() occurrs it is relative to the right cgroup path.
> > 
> > 2. not reject cgroup filesystems with no root. A cgroup ns mount looks
> >    like:
> > 
> > 	 223 222 0:22 /lxc/unpriv /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd,nsroot=/lxc/unpriv
> > 
> >    i.e. it has /lxc/unpriv as its root, and thus doesn't look rooted to CRIU.
> >    Let's allow cgroup mounts to be unrooted so we can deal with this.
> 
> I have a suggestion how to avoid the hackish checks in validate and can_mount_now().
> 
> 1. Add ->read_img callback to fstype called when reading mount points 
>    images (collect_mnt_from_image) 
> 2. For cgroupfs check for root_ns_mask to contain CLONE_NEWCGROUPNS and
>    cut the mi->root to be "/" effectively turning the mount point into
>    fsroot one
> 
> and leave the hunk that moves tasks into cgroups earlier. Hopefully before
> setting up namespaces would work, all the more so we configure the namespaces
> at the very end.
> 
> What do you think?

Hmm, except we will still get a dump failure because validate_mounts()
is called on dump as well.

What about a fstype->munge_mount callback that is called at the end of
parse_mountinfo_ent, so that you can arbitrarily munge a mount info if
necessary. Same trick applies, we just set mi->root to be "/" if
root_ns_mask & CLONE_NEWCGROUP.

Tycho

> -- Pavel
> 
> > Signed-off-by: Tycho Andersen <tycho.andersen at canonical.com>
> > ---
> >  criu/cr-restore.c | 18 +++++++++---------
> >  criu/mount.c      |  5 ++++-
> >  2 files changed, 13 insertions(+), 10 deletions(-)
> > 
> > diff --git a/criu/cr-restore.c b/criu/cr-restore.c
> > index ffd2f01..967d5fa 100644
> > --- a/criu/cr-restore.c
> > +++ b/criu/cr-restore.c
> > @@ -1608,6 +1608,15 @@ static int restore_task_with_children(void *_arg)
> >  			goto err;
> >  	}
> >  
> > +	/*
> > +	 * Call this _before_ forking to optimize cgroups
> > +	 * restore -- if all tasks live in one set of cgroups
> > +	 * we will only move the root one there, others will
> > +	 * just have it inherited.
> > +	 */
> > +	if (prepare_task_cgroup(current) < 0)
> > +		goto err;
> > +
> >  	/* Restore root task */
> >  	if (current->parent == NULL) {
> >  		if (restore_finish_stage(CR_STATE_RESTORE_NS) < 0)
> > @@ -1640,15 +1649,6 @@ static int restore_task_with_children(void *_arg)
> >  	if (prepare_mappings())
> >  		goto err;
> >  
> > -	/*
> > -	 * Call this _before_ forking to optimize cgroups
> > -	 * restore -- if all tasks live in one set of cgroups
> > -	 * we will only move the root one there, others will
> > -	 * just have it inherited.
> > -	 */
> > -	if (prepare_task_cgroup(current) < 0)
> > -		goto err;
> > -
> >  	if (prepare_sigactions() < 0)
> >  		goto err;
> >  
> > diff --git a/criu/mount.c b/criu/mount.c
> > index eb8d058..542f1d6 100644
> > --- a/criu/mount.c
> > +++ b/criu/mount.c
> > @@ -734,7 +734,7 @@ static int validate_mounts(struct mount_info *info, bool for_dump)
> >  					 */
> >  					ret = m->need_plugin ? 0 : -ENOTSUP;
> >  
> > -				if (ret < 0) {
> > +				if (ret < 0 && m->fstype->code != FSTYPE__CGROUP) {
> >  					if (ret == -ENOTSUP)
> >  						pr_err("%d:%s doesn't have a proper root mount\n",
> >  								m->mnt_id, m->mountpoint);
> > @@ -2453,6 +2453,9 @@ static bool can_mount_now(struct mount_info *mi)
> >  	if (mi->external)
> >  		goto shared;
> >  
> > +	if (mi->fstype->code == FSTYPE__CGROUP)
> > +		return true;
> > +
> >  	/*
> >  	 * We're the slave peer:
> >  	 *   - Make sure the master peer is already mounted
> > 
>