[CRIU] [PATCH 2/2] Make lacking cgroup properties non-fatal

Thu Aug 14 09:37:09 PDT 2014

I am unfortunately leaving back for school at the end of this week so my
ability to further contribute will be limited.

If we want a failure in the cgroup property restore to stop the entire
restoration, we need a patch further up the call chain. I agree the
behavior should either be a complete stop of the restore or the behavior
supplied by the above patch.

While it seems there are different cases where we want different behaviors
when properties are missing on the restore side, I think we are all in
agreement that properties that are missing during dump should just be
ignored. Would you like me to break this patch into 2 patches, handling
missing property failure on the dump side and the restore side, so we can
merge the dump patch? Getting this part merged is a priority for me as it
blocking some of my colleagues.

On Thu, Aug 14, 2014 at 7:15 AM, Pavel Emelyanov <xemul at parallels.com>
wrote:

> On 08/14/2014 06:01 PM, Serge Hallyn wrote:
> > Quoting Pavel Emelyanov (xemul at parallels.com):
> >> On 08/14/2014 01:46 AM, Garrison Bellack wrote:
> >>> On Wed, Aug 13, 2014 at 1:55 PM, Tycho Andersen <
> tycho.andersen at canonical.com <mailto:tycho.andersen at canonical.com>> wrote:
> >>>
> >>>     On Wed, Aug 13, 2014 at 01:32:01PM -0700, Garrison Bellack wrote:
> >>>     > On Wed, Aug 13, 2014 at 12:43 PM, Andrew Vagin <
> avagin at parallels.com <mailto:avagin at parallels.com>> wrote:
> >>>     >
> >>>     > > On Wed, Aug 13, 2014 at 11:59:33AM -0700, gbellack at google.com
> <mailto:gbellack at google.com> wrote:
> >>>     > > > From: Garrison Bellack <gbellack at google.com <mailto:
> gbellack at google.com>>
> >>>     > > >
> >>>     > > > Because different kernel versions have different cgroup
> properties, criu
> >>>     > > > shouldn't crash just because the properties statically
> listed aren't
> >>>     > > exact.
> >>>     > > > Instead, during dump, ignore properties the kernel doesn't
> have and
> >>>     > > continue.
> >>>     > > > In addition, during restore, in the event of migration to a
> kernel that
> >>>     > > is
> >>>     > > > missing a property that was dumped, print an error but
> continue.
> >>>     > >
> >>>     > > I am not sure that we can continue in this case. Why do you
> think that
> >>>     > > it's safe? I think we need to ask a user about this. So can we
> add a new
> >>>     > > option to allow continuing in this case?
> >>>     >
> >>>     >
> >>>     > I'm a little confused about your concerns of safety. Are you
> referring to
> >>>     > missing properties on the dump or restore side?
> >>>
> >>>     I think restore side. I also wondered the same thing -- it seems
> that
> >>>     missing properties on restore should be an error (perhaps as Andrew
> >>>     says with an --allow-missing-cg-props flag on the restore side to
> >>>     allow you to override this error).
> >>>
> >>>
> >>> Here is the reasoning behind this decision. Let say you are migrating
> properties A,B,C,D where B doesn't exist.
> >>>
> >>> This is how it currently works:
> >>> A will be restored. B prints an error and causes failure because it is
> missing. However, because cgroup property restoration is the last thing to
> happen during restore, and because of designs further up the call chain,
> the restoration of the process still goes through. C,D are not restored
> because of previous criu failure.
> >>> End result -- Process and property A restored, C and D are not
> restored even though they exist on the new machine
> >>>
> >>> With this patch:
> >>> A will be restored. B prints an error and continues. C and D are
> restored.
> >>> End result --  Process and property A, C, D all restored.
> >>>
> >>> I think between these two options we clearly prefer the later behavior.
> >>
> >> Absolutely. AFAIU Andrew and Tycho meant, that there could be the 3rd
> behavior -- once B
> >> fails we abort all the restore. And they proposed an option for
> controlling this.
> >>
> >> I have a question about this -- what does LXC tool do if it meets a
> cgroup configuration
> >> parameter in container's config, that is missing in the kernel?
> >
> > Container start will fail.  The difference is that if there is a cgroup
> > limit in the container's configuration file, then it was specifically
> asked
> > for.  Ideally for a criu restart, we want the same behavior
> (overrideable),
> > but we want limits which were not asked for to be ignored.
>
> OK, so we need some fine-grained control over what to do with properties.
> And since we've started talking about this. In OpenVZ kernel we have much
> more cgroup parameters, than those currently listed in CRIU (e.g. kernel
> memory control). In the future we will likely have some of them upstream,
> and definitely will have some completely new. Patching criu every time
> just to add new properties seems quite inconvenient. Especially, taking
> into account that this is just the list of strings.
>
> Can we teach the get_known_properties() to accept lists from the caller
> and use similar lists on restore to specify which of them can be ignored?
>
> Thoughts?
>
> > I.e. if I've checkpointed a nested container which has no cpuset
> specified,
> > and the parent container had a cpuset limit specified (say cpu 0), then
> if I
> > restart under another parent container which has a different cpuset (say
> > cpu 1), then restart shouldn't fail, bc that limit was not inherent to
> > the container I checkpointed.
>
> Thanks,
> Pavel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20140814/eb3c0e88/attachment.html>