[CRIU] [PATCH 1/2] finalize cgroups correctly

Thu Mar 26 02:44:25 PDT 2015

On 03/26/2015 02:46 AM, Saied Kazemi wrote:
> Sorry for the delay in getting back.  Here are the answers I owe you:
> 
> 1. Re kernel, I am running 3.13 kernel.  I also have a 3.16 kernel, 
>    but nothing higher.

OK, so we do know that in your case rmdir() will not drop mounts from 
other namespaces.

> 2. Re breaking things, we won't break anything to the best of my knowledge.
>    So it's safe to disallow image directory inside root.

Yup, agreed :)

> 3. Re umount_cgyard(), it works with the following patch.  It's a quick hack
>    because it assumes root is a subset of imgs_dir.  What's the best way to 
>    generalize this?

I would say that umount2_at(fd, path) is the way to go, but it would be
long and complicated. Complicated because Al Viro is currently reworking
the whole mounting API, so adding one more syscall to rework would be
at least ignored by him :)

So we need a fix right in the CRIU with the existing kernel. Typically the
do_anything_at(fd, path) code can be done like this

int do_anything_at(fd, sub_path)
{
	char path[];
	sprintf(path, "/proc/self/fd/%d/%s", fd, sub_path);
	return do_anything(path);
}

i.e. -- go via /proc/self/fd/%d link, kernel follows these links in an nice
manner --just by referencing the file's dentry, not by re-looking-up the path.

And we do use this heavily all over the CRIU code, but IIRC we've never tried
this for mount/umount. Can you check whether umount2() will work this way? I
think it should as kernel uses one path resolution code for everything even
umount, but maybe there's some extra check I'm unaware of...

Thanks,
Pavel

> diff --git a/cgroup.c b/cgroup.c
> index b1a2f92..913f480 100644
> --- a/cgroup.c
> +++ b/cgroup.c
> @@ -939,6 +939,22 @@ int prepare_task_cgroup(struct pstree_item *me)
>         return move_in_cgroup(se);
>  }
>  
> +void umount_cgyard(void)
> +{
> +       char path[PATH_MAX];
> +       int n;
> +
> +       if (!cg_yard)
> +               return;
> +
> +       n = strlen(opts.root);
> +       if (!strncmp(opts.imgs_dir, opts.root, n)) {
> +               snprintf(path, PATH_MAX, "%s/%s", &opts.imgs_dir[n], cg_yard);
> +               if (umount2(path, MNT_DETACH) == -1)
> +                       pr_perror("Unable to unmount cg_yard %s", path);
> +       }
> +}
> +
>  void fini_cgroup(void)
>  {
>         if (!cg_yard)
> diff --git a/crtools.c b/crtools.c
> index 0b3c497..202acd8 100644
> --- a/crtools.c
> +++ b/crtools.c
> @@ -278,6 +278,7 @@ int main(int argc, char *argv[], char *envp[])
>                         opts.restore_sibling = true;
>                         break;
>                 case 'D':
> +                       opts.imgs_dir = optarg;
>                         imgs_dir = optarg;
>                         break;
>                 case 'W':
> diff --git a/include/cr_options.h b/include/cr_options.h
> index d6ce2ae..90eb0ed 100644
> --- a/include/cr_options.h
> +++ b/include/cr_options.h
> @@ -60,6 +60,7 @@ struct cr_options {
>         char                    *new_global_cg_root;
>         struct list_head        new_cgroup_roots;
>         bool                    aufs;           /* auto-deteced, not via cli */
> +       char                    *imgs_dir;
>  };
>  
>  extern struct cr_options opts;
> diff --git a/mount.c b/mount.c
> index e59cb0d..df668ce 100644
> --- a/mount.c
> +++ b/mount.c
> @@ -2013,12 +2013,16 @@ static int do_restore_task_mnt_ns(struct ns_id *nsid)
>  
>  int restore_task_mnt_ns(struct pstree_item *current)
>  {
> +       extern void umount_cgyard(void);
> +
>         if (current->ids && current->ids->has_mnt_ns_id) {
>                 unsigned int id = current->ids->mnt_ns_id;
>                 struct ns_id *nsid;
>  
> -               if (root_item->ids->mnt_ns_id == id)
> +               if (root_item->ids->mnt_ns_id == id) {
> +                       umount_cgyard();
>                         return 0;
> +               }
>  
>                 nsid = lookup_ns_by_id(id, &mnt_ns_desc);
>                 if (nsid == NULL) {
> @@ -2026,6 +2030,7 @@ int restore_task_mnt_ns(struct pstree_item *current)
>                         return -1;
>                 }
>  
> +               umount_cgyard();
>                 if (do_restore_task_mnt_ns(nsid))
>                         return -1;
>         }
> 
> --Saied
> 
> 
> On Tue, Mar 24, 2015 at 11:52 AM, Pavel Emelyanov <xemul at parallels.com <mailto:xemul at parallels.com>> wrote:
> 
>     On 03/24/2015 09:30 PM, Saied Kazemi wrote:
>     >
>     >
>     > On Tue, Mar 24, 2015 at 10:57 AM, Pavel Emelyanov <xemul at parallels.com <mailto:xemul at parallels.com> <mailto:xemul at parallels.com <mailto:xemul at parallels.com>>> wrote:
>     >
>     >     On 03/24/2015 08:25 PM, Saied Kazemi wrote:
>     >     > For the easy case of comparing pathname strings on the command line (-D and --root), I
>     >     > think it's better to error and exit instead of restoring with all the cgyard mounts
>     >     > leaked into the restored process tree.
>     >
>     >     I rather agree with this. It's quite uncommon (from my perspective) that images sit
>     >     in the container's root, so preventing this usage doesn't (shouldn't) break anything
>     >     critical. Does it?
>     >
>     >
>     > It wouldn't break anything because nothing has been merged yet.
> 
>     I meant -- right now we allow for images to be under root, but we plan
>     to ban this and if we do, nothing should get broken.
> 
>     >     > For a more general solution, is it possible to make sure that after fini_cgroup() the
>     >     > cgyard mount points have been successfully removed from the restored tree?
>     >
>     >     What if we call umount with detach flag on cgyard dir inside restore_task_mnt_ns()?
>     >     Would this resolve this?
>     >
>     >
>     > I just quickly tried the following patch, first calling umount_cgyard() before if and then
>     > after if.  Neither case worked :(
> 
>     The cg_yard variable is just a ".criu.cgyard.XXXXX" relative path. When
>     calling umount2 on it kernel would try to resolve one relative to task's
>     cwd.
> 
>     As far "calling this after if" is concerned -- yes, it's supposed to fail
>     since restoring task's mntns implies calling pivot_roo() which moves task's
>     cwd into new root and thus the cg_yard starts pointing to non-existing
>     directory.
> 
>     But "before if"... I have no explanation for it. With what error does the
>     umount fail in this case?
> 
>     > --Saied
>     >
>     > diff --git a/cgroup.c b/cgroup.c
>     > index b1a2f92..fa94853 100644
>     > --- a/cgroup.c
>     > +++ b/cgroup.c
>     > @@ -951,6 +951,12 @@ void fini_cgroup(void)
>     >         cg_yard = NULL;
>     >  }
>     >
>     > +void umount_cgyard(void)
>     > +{
>     > +       if (cg_yard)
>     > +               umount2(cg_yard, MNT_DETACH);
>     > +}
>     > +
>     >  static int restore_cgroup_prop(const CgroupPropEntry * cg_prop_entry_p,
>     >                                char *path, int off)
>     >  {
>     > diff --git a/include/cgroup.h b/include/cgroup.h
>     > index 14fef67..5bcf028 100644
>     > --- a/include/cgroup.h
>     > +++ b/include/cgroup.h
>     > @@ -10,6 +10,7 @@ int prepare_cgroup(void);
>     >  /* Restore things like cpu_limit in known cgroups. */
>     >  int prepare_cgroup_properties(void);
>     >  void fini_cgroup(void);
>     > +void umount_cgyard(void);
>     >
>     >  struct cg_controller;
>     >
>     > diff --git a/mount.c b/mount.c
>     > index e59cb0d..b0b34cb 100644
>     > --- a/mount.c
>     > +++ b/mount.c
>     > @@ -2028,6 +2028,7 @@ int restore_task_mnt_ns(struct pstree_item *current)
>     >
>     >                 if (do_restore_task_mnt_ns(nsid))
>     >                         return -1;
>     > +               umount_cgyard();
>     >         }
>     >
>     >         return 0;
>     >
> 
>