[CRIU] [PATCH 1/2] Re-create cgroups if necessary

Tue Jun 24 13:24:34 PDT 2014

Glad things are clearer now and we're converging...  The only remaining
decision is whether to use the same cgroup as before or not (/cpuset/lxc/u1
or /cpuset/lxc/u1.1 in your example).  I would argue that since the state
of a process after restore should be the same as before dump, it should be
placed in /cpuset/lxc/u1.

With a CLI option we tell CRIU:

1. Expect the cgroup to already exist, just put the process back in it.  If
cgroup doesn't exist, fail.
2. Expect the cgroup not to exist, create it and put the process in it.  If
cgroup exists, fail.

Hope this makes sense.

--Saied

On Tue, Jun 24, 2014 at 1:12 PM, Serge Hallyn <serge.hallyn at ubuntu.com>
wrote:

> Quoting Pavel Emelyanov (xemul at parallels.com):
> > On 06/24/2014 11:34 PM, Saied Kazemi wrote:
> > >
> > >
> > >
> > > On Tue, Jun 24, 2014 at 10:05 AM, Pavel Emelyanov <xemul at parallels.com
> <mailto:xemul at parallels.com>> wrote:
> > >
> > >     On 06/24/2014 09:01 PM, Saied Kazemi wrote:
> > >     >
> > >     >
> > >     >
> > >     > On Tue, Jun 24, 2014 at 9:26 AM, Pavel Emelyanov <
> xemul at parallels.com <mailto:xemul at parallels.com> <mailto:
> xemul at parallels.com <mailto:xemul at parallels.com>>> wrote:
> > >     >
> > >     >     On 06/24/2014 06:12 PM, Serge Hallyn wrote:
> > >     >
> > >     >     >> Yes. Emply cgroups cannot be discovered through
> /proc/pid/cgroup file,
> > >     >     >> we should walk the alive cgroup mount. But the problem is
> -- we cannot
> > >     >     >> just take the system /sys/fs/cgroup/ directories, since
> there will be
> > >     >     >> cgroups from other containers as well. We should find the
> root subdir
> > >     >     >> of the container we dump and walk _this_ subtree.
> > >     >     >
> > >     >     > I volunteer to work on a proper cgroup c/r implementation,
> once Tycho
> > >     >     > gets the very basics done.
> > >     >
> > >     >     Serge, Tycho, I think I need to clarify one more thing.
> > >     >
> > >     >     I believe, that once we do full cgroups hierarchy restore
> all the
> > >     >     mkdirs would go away from the move_in_cgroup() routine.
> Instead,
> > >     >     we will have some code, that would construct all the cgroup
> subtree
> > >     >     before criu will start forking tasks. And once we have it,
> the
> > >     >     move_in_cgroup() would (should) never fail. Thus this patch
> would
> > >     >     be effectively reversed.
> > >     >
> > >     >     Thanks,
> > >     >     Pavel
> > >     >
> > >     >
> > >     > I agree.  Creation of the cgroup and its subtree should be done
> in one place as opposed
> > >     > to being split apart (i.e., between prepare_cgroup_sfd() and
> move_in_cgroup() as is done
> > >     > currently).
> > >     >
> > >     > Regarding the 4 items to do for cgroups in your earlier email, I
> believe that we should
> > >     > have CLI options to tell CRIU what cgroups it needs to restore
> (almost like the way we
> > >     > tell it about external bind mounts).
> > >
> > >     I was thinking that if we take the root task, check cgroups it
> lives in and
> > >     dump the whole subtree starting from it, this would work properly
> and would
> > >     not require and CLI hints.
> > >
> > >     Do you mean, that we need to tell criu where in cgroup hierarchy
> to start
> > >     recreating the subtree it dumped?
> > >
> > >     > This way we can handle the empty cgroups as well as dumping and
> restoring on the same
> > >     > machine versus on a different machine (i.e., migration).  For
> migration, CRIU definitely
> > >     > needs to be told how to handle cgroups name collision.
> > >
> > >     But if we ask criu to restore tasks in a fresh new sub-cgroup, why
> would this
> > >     collision happen?
> > >
> > >     > This is not something that it can handle at dump time.
> > >     >
> > >     > --Saied
> > >
> > >
> > > I am not sure if I understand what is meant by "fresh new sub-cgroup".
>  Since the process
> > > has to be restored in the same cgroup, I assume you mean a new
> mountpoint.  But if the
> > > cgroup already exists, giving it a private new mountpoint doesn't mean
> that it will set
> > > up a new hierarchy.  Consider the following example:
> > >
> > > # cat /sys/fs/cgroup/hugetlb/notify_on_release
> > > # mkdir /mnt/foo
> > > # mount -t cgroup -o hugetlb cgroup /mnt/foo
> > > # cat /mnt/foo/notify_on_release
> > > 0
> > > # echo 1 > /sys/fs/cgroup/hugetlb/notify_on_release
> > > # cat /mnt/foo/notify_on_release
> > > 1
> > > # echo 0 > /mnt/foo/notify_on_release
> > > # cat /sys/fs/cgroup/hugetlb/notify_on_release
> > > 0
> > > #
> > >
> > > So I think we need a mechanism to tell CRIU whether it should expect
> the cgroup already existing
> > > (e.g., restore on the same machine) or not (e.g., restore after reboot
> or on a different machine).
> > >
> > > I am not a cgroups expert, but I hope it's more clear now.
> >
> > Yes, thank you :) My understanding of cgroups tells me that we don't
> need special option
> > for that. AFAIU LXC and OpenVZ don't fail if they create cgroup that
> already exists,
> > neither should CRIU.
>
> Right, if the taskset was under /cpuset/lxc/u1, for instance, then if u1
> is running (or /cpuset/lxc/u1 was not cleaned up) then the criu should
> simply use /cpuset/lxc/u1.1, then u1.2, etc.  Under that, since u1.N did
> not exist, there should be no collisions (and if there are it's cause
> for failing the restart as we either have a bug, or some race with
> another criu instance or another toolset)
>
> (And of course I agree that we should create and configure all cgroups
> before we restart any tasks.)
>
> -serge
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20140624/59719d4c/attachment-0001.html>