[CRIU] lxc - cgroup related restore error

Wed Jul 13 08:56:29 PDT 2016

On Wed, Jul 13, 2016 at 09:30:02AM -0600, Tycho Andersen wrote:
> On Wed, Jul 13, 2016 at 05:17:24PM +0200, Adrian Reber wrote:
> > On Wed, Jul 13, 2016 at 08:27:42AM -0600, Tycho Andersen wrote:
> > > On Wed, Jul 13, 2016 at 12:49:07PM +0200, Adrian Reber wrote:
> > > > On Wed, Jul 13, 2016 at 01:41:34PM +0300, Cyrill Gorcunov wrote:
> > > > > On Wed, Jul 13, 2016 at 12:29:01PM +0200, Adrian Reber wrote:
> > > > > > 
> > > > > > If I am trying to migrate a process while a LXC container is running on
> > > > > > the source system the migration fails during restore on the destination
> > > > > > system with:
> > > > > > 
> > > > > > Error (cgroup.c:1193): cg: Failed writing 0-3 to cpuset//lxc/c7/cpuset.cpus: Numerical result out of range
> > > > > > Error (cgroup.c:1470): cg: Restoring special cpuset props failed!
> > > > > > 
> > > > > > This happens with CRIU 2.3 and latest GIT.
> > > > > > 
> > > > > > If I am running a LXC container on the destination system I still get
> > > > > > this error. If I am stopping the LXC container on the source system the
> > > > > > error disappears. This is again on a RHEL7 system with a 3.10.something
> > > > > > kernel.
> > > > > 
> > > > > Looks like you're migratin into machine with less number of cpus?
> > > > 
> > > > Yes, that's true. Haven't checked that before. I am using two virtual
> > > > machines and it seems like I have forgotten that I changed the specs.
> > > > 
> > > > But as the migration works when LXC is stopped it would be nice to have
> > > > it working with LXC running. Migrating the container from one system to
> > > > another also works without errors. Only migrating a process unrelated to
> > > > the LXC container does not work.
> > > 
> > > Sorry, I'm not sure I understand this paragraph. What does it mean to
> > > migrate when LXC is stopped?
> > 
> > I meant, I cannot migrate a process when a LXC container is running as I
> > get the cgroup error from above. When no LXC container is running the
> > cgroup error does not happen. More understandable now?
> 
> Hmm. So is the LXC container contained in the process's subtree? What
> cpuset cgroup is it in (cat /proc/pid/cgroup for the task you're
> trying to migrate)?

My test process is called 'minimal'. It malloc()s a page and reads from
that page in a loop with sleeps in-between. That is the cgroup
information of that:

# cat /proc/15950/cgroup 
11:memory:/user.slice
10:hugetlb:/
9:devices:/user.slice
8:freezer:/
7:cpuacct,cpu:/user.slice
6:pids:/
5:cpuset:/
4:blkio:/user.slice
3:net_prio,net_cls:/
2:perf_event:/
1:name=systemd:/user.slice/user-0.slice/session-2.scope

This is the process tree of my container, which is unrelated to the
process above:

19440 pts/0    S      0:00 [lxc monitor] /var/lib/lxc c7
19445 ?        Ss     0:00  \_ /sbin/init
19476 ?        Ss     0:00      \_ /sbin/dhclient -H c7 -1 -q -lf /var/lib/dhclient/dhclient--eth0.lease -pf /var/run/dhclient-eth0.pid eth0
19477 ?        S      0:10      \_ /usr/bin/postgres -D /var/lib/pgsql/data -p 5432
19502 ?        Ss     0:01      |   \_ postgres: stats collector process   
19503 ?        Ss     0:00      |   \_ postgres: autovacuum launcher process   
19504 ?        Ss     0:00      |   \_ postgres: wal writer process   
19505 ?        Ss     0:00      |   \_ postgres: writer process   
19506 ?        Ss     0:00      |   \_ postgres: checkpointer process   
19507 ?        Ss     0:00      |   \_ postgres: logger process   
19478 ?        Ss     0:00      \_ /usr/sbin/rsyslogd -n
19479 ?        Ss     0:00      \_ /usr/sbin/sshd -D
19480 ?        Ss     0:00      \_ /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
19481 ?        Ss     0:00      \_ /usr/lib/systemd/systemd-logind
19482 ?        Ssl    0:24      \_ /usr/lib/jvm/jre/bin/java -Djava.security.egd=file:/dev/./urandom -classpath /usr/share/tomcat/bin/bootstrap.jar:/usr/share/tomcat/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar 
19483 ?        Ss     0:00      \_ /usr/lib/systemd/systemd-journald

The process 'minimal' and the container 'c7' should be completely
unrelated.

		Adrian