[CRIU] [PATCH 1/2] Re-create cgroups if necessary

Tue Jun 24 13:12:10 PDT 2014

Quoting Pavel Emelyanov (xemul at parallels.com):
> On 06/24/2014 11:34 PM, Saied Kazemi wrote:
> > 
> > 
> > 
> > On Tue, Jun 24, 2014 at 10:05 AM, Pavel Emelyanov <xemul at parallels.com <mailto:xemul at parallels.com>> wrote:
> > 
> >     On 06/24/2014 09:01 PM, Saied Kazemi wrote:
> >     >
> >     >
> >     >
> >     > On Tue, Jun 24, 2014 at 9:26 AM, Pavel Emelyanov <xemul at parallels.com <mailto:xemul at parallels.com> <mailto:xemul at parallels.com <mailto:xemul at parallels.com>>> wrote:
> >     >
> >     >     On 06/24/2014 06:12 PM, Serge Hallyn wrote:
> >     >
> >     >     >> Yes. Emply cgroups cannot be discovered through /proc/pid/cgroup file,
> >     >     >> we should walk the alive cgroup mount. But the problem is -- we cannot
> >     >     >> just take the system /sys/fs/cgroup/ directories, since there will be
> >     >     >> cgroups from other containers as well. We should find the root subdir
> >     >     >> of the container we dump and walk _this_ subtree.
> >     >     >
> >     >     > I volunteer to work on a proper cgroup c/r implementation, once Tycho
> >     >     > gets the very basics done.
> >     >
> >     >     Serge, Tycho, I think I need to clarify one more thing.
> >     >
> >     >     I believe, that once we do full cgroups hierarchy restore all the
> >     >     mkdirs would go away from the move_in_cgroup() routine. Instead,
> >     >     we will have some code, that would construct all the cgroup subtree
> >     >     before criu will start forking tasks. And once we have it, the
> >     >     move_in_cgroup() would (should) never fail. Thus this patch would
> >     >     be effectively reversed.
> >     >
> >     >     Thanks,
> >     >     Pavel
> >     >
> >     >
> >     > I agree.  Creation of the cgroup and its subtree should be done in one place as opposed
> >     > to being split apart (i.e., between prepare_cgroup_sfd() and move_in_cgroup() as is done
> >     > currently).
> >     >
> >     > Regarding the 4 items to do for cgroups in your earlier email, I believe that we should
> >     > have CLI options to tell CRIU what cgroups it needs to restore (almost like the way we
> >     > tell it about external bind mounts).
> > 
> >     I was thinking that if we take the root task, check cgroups it lives in and
> >     dump the whole subtree starting from it, this would work properly and would
> >     not require and CLI hints.
> > 
> >     Do you mean, that we need to tell criu where in cgroup hierarchy to start
> >     recreating the subtree it dumped?
> > 
> >     > This way we can handle the empty cgroups as well as dumping and restoring on the same
> >     > machine versus on a different machine (i.e., migration).  For migration, CRIU definitely
> >     > needs to be told how to handle cgroups name collision.
> > 
> >     But if we ask criu to restore tasks in a fresh new sub-cgroup, why would this
> >     collision happen?
> > 
> >     > This is not something that it can handle at dump time.
> >     >
> >     > --Saied
> > 
> > 
> > I am not sure if I understand what is meant by "fresh new sub-cgroup".  Since the process 
> > has to be restored in the same cgroup, I assume you mean a new mountpoint.  But if the
> > cgroup already exists, giving it a private new mountpoint doesn't mean that it will set
> > up a new hierarchy.  Consider the following example:
> > 
> > # cat /sys/fs/cgroup/hugetlb/notify_on_release 
> > # mkdir /mnt/foo
> > # mount -t cgroup -o hugetlb cgroup /mnt/foo
> > # cat /mnt/foo/notify_on_release 
> > 0
> > # echo 1 > /sys/fs/cgroup/hugetlb/notify_on_release
> > # cat /mnt/foo/notify_on_release 
> > 1
> > # echo 0 > /mnt/foo/notify_on_release 
> > # cat /sys/fs/cgroup/hugetlb/notify_on_release 
> > 0
> > #
> > 
> > So I think we need a mechanism to tell CRIU whether it should expect the cgroup already existing
> > (e.g., restore on the same machine) or not (e.g., restore after reboot or on a different machine).
> > 
> > I am not a cgroups expert, but I hope it's more clear now.
> 
> Yes, thank you :) My understanding of cgroups tells me that we don't need special option 
> for that. AFAIU LXC and OpenVZ don't fail if they create cgroup that already exists,
> neither should CRIU.

Right, if the taskset was under /cpuset/lxc/u1, for instance, then if u1
is running (or /cpuset/lxc/u1 was not cleaned up) then the criu should
simply use /cpuset/lxc/u1.1, then u1.2, etc.  Under that, since u1.N did
not exist, there should be no collisions (and if there are it's cause
for failing the restart as we either have a bug, or some race with
another criu instance or another toolset)

(And of course I agree that we should create and configure all cgroups
before we restart any tasks.)

-serge