[CRIU] c/r of tasks with currently open cgroup files

Mon Sep 14 08:29:39 PDT 2015

Quoting Tycho Andersen (tycho.andersen at canonical.com):
> On Mon, Sep 14, 2015 at 02:56:52PM +0000, Serge Hallyn wrote:
> > Quoting Tycho Andersen (tycho.andersen at canonical.com):
> > > On Mon, Sep 14, 2015 at 01:39:50PM +0300, Pavel Emelyanov wrote:
> > > > On 09/11/2015 11:36 PM, Tycho Andersen wrote:
> > > > > Hi all,
> > > > > 
> > > > > Some tasks may want to open a cgroup directory (or file):
> > > > > 
> > > > > (00.008556) 5826 fdinfo 5: pos: 0x               0 flags:          2304000/0x1
> > > > > (00.008586) Dumping path for 5 fd via self 31 [/sys/fs/cgroup/systemd/lxc/priv]
> > > > > 
> > > > > The problem is that on restore, criu users can pass --cgroup-root=/lxc/priv2
> > > > > (or whatever) to rewrite their root cgroup paths, and this path is not created.
> > > > 
> > > > The --cgroup-root implicitly implies :) that tasks don't see the full paths of
> > > > cgroup files, i.e. thy live in a FS tree where /sys/fs/cgroup/anything points
> > > > to some /sys/fs/cgroup/anything/foo/ directory on host in which the tasks we
> > > > dump actually live. And putting --cgroup-root on restore means that you just
> > > > want to move tasks from /sys/.../foo into /sys/.../bar fixing the visible FS
> > > > tree accordingly (with the --ext-mount-map I suppose).
> > > > 
> > > > So how can this happen that task in a container sees full cgroup path?
> > > 
> > > Under lxcfs, tasks do see the full path, they just get EACCES if they
> > > try to read/write from it. I suppose another option would be to patch
> > > lxcfs to function somewhat like cgroup namespaces and do as you say
> > > and hide parts of the cgroup tree. Do you know why it doesn't work
> > > this way know Serge?
> > 
> > Not sure I understand the question.  lxcfs limits the visibility to
> > your container and its descendents, so if before checkpoint you were
> > under /sys/fs/cgroup/devices/lxc/foo1, and after restore you are under
> > /sys/fs/cgroup/devices/lxc/bar1, then tasks in the container will only
> > see bar1 under /sys/fs/cgroup/devices/lxc.
> 
> Right, I think the issue is that they see the full path, i.e.:
> 
> criu2:~ lxc exec unpriv bash
> root at unpriv:~# cat /proc/self/cgroup 
> 10:memory:/lxc/unpriv
> ...
> 
> instead of:
> 
> 10:memory:/
> 
> > There's no way right now for lxcfs to know that it should pretend
> > that /sys/fs/cgroup/devices/lxc/foo1 now really means
> > /sys/fs/cgroup/devices/lxc/bar1.
> > 
> > The way lxcfs mounts are set up is by a post-mount hook script.  So
> > you could simply, in criu, set up the scripts to set up /sys/fs/cgroup
> > so that /sys/fs/cgroup/devices/lxc/foo1 are a path in tmpfs, and
> > /var/lib/lxcfs/cgroup/devices/lxc/bar1 is bind-mounted straight onto
> > the restored container's /sys/fs/cgroup/devices/lxc/foo1.
> 
> This could work. Although perhaps our use of --cgroup-root in LXC is
> incorrect since we don't have anything like cgroup namespaces in the
> Ubuntu kernels (yet).

Right well you need to decide which path you want to use - or bind
both into the container.  If you want to continue to honor the old
cgroup path, bind /sys/fs/cgroup/devices/lxc/foo1.  If you want to
honor what's in /proc/self/cgroup, bind /sys/fs/cgroup/devices/lxc/bar1
Or bind both.

Or, we can virtualize /proc/self/cgroup through lxcfs - except we'd
have to decide how lxcfs should know what paths to use.

Anyway, yes, cgroupns.  I'll be starting any day now.

-serge