[CRIU] lxc-checkpoint restore failed

Tycho Andersen tycho.andersen at canonical.com
Wed Oct 14 11:54:41 PDT 2015


Hi Jason, Pavel,

On Wed, Oct 14, 2015 at 03:03:37PM +0300, Pavel Emelyanov wrote:
> Adding Tycho (an LXC guy) to the discussion.
> 
> On 10/14/2015 06:56 AM, Jason Lee wrote:
> > Hi all!
> > Recently I use lxc-checkpoint to c/r linux container.When dumping criu,It's no 
> > problem.but I use lxc-checkpoint -r to restore one lxc. It's failed!
> > BTW My host os is debian 8 .Here is my enviorment:
> > 
> > lxc.rootfs = /usr/local/var/lib/lxc/d1/rootfs
> > lxc.include = /usr/local/share/lxc/config/debian.common.conf
> > lxc.utsname = d1
> > lxc.arch = amd64
> > lxc.tty = 0
> > lxc.pts = 1
> > lxc.console = none
> > 
> > #lxc.cap.drop = sys_module mac_admin mac_override sys_time
> > lxc.cgroup.devices.deny = c 5:1 rwm
> > lxc.aa_allow_incomplete = 1
> > lxc.network.type = veth
> > lxc.network.flags = up
> > # that's the interface defined above in host's interfaces file
> > lxc.network.link = br0
> > # name of network device inside the container,
> > # defaults to eth0, you could choose a name freely
> > # lxc.network.name <http://lxc.network.name> = lxcnet0 
> > lxc.network.hwaddr = 00:16:3e:d2:29:be
> > 
> > mount point:
> > root at dslab:/home# mount
> > sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
> > proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
> > udev on /dev type devtmpfs (rw,relatime,size=10240k,nr_inodes=1002688,mode=755)
> > devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
> > tmpfs on /run type tmpfs (rw,nosuid,relatime,size=1607656k,mode=755)
> > /dev/sda6 on / type ext4 (rw,relatime,errors=remount-ro,data=ordered)
> > securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
> > tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
> > tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
> > tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
> > cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
> > pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
> > cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset,clone_children)
> > cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
> > cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
> > cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
> > cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
> > cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
> > cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
> > cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
> > cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
> > cgroup on /sys/fs/cgroup/debug type cgroup (rw,nosuid,nodev,noexec,relatime,debug)
> > cgroup on /sys/fs/cgroup/palloc type cgroup (rw,nosuid,nodev,noexec,relatime,palloc)
> > systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=23,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
> > debugfs on /sys/kernel/debug type debugfs (rw,relatime)
> > mqueue on /dev/mqueue type mqueue (rw,relatime)
> > hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime)
> > /dev/sda4 on /boot type ext4 (rw,relatime,data=ordered)
> > rpc_pipefs on /run/rpc_pipefs type rpc_pipefs (rw,relatime)
> > 
> > root at dslab:/home# lxc-checkpoint -r -n d1 -D /home/checkpoint_dir/d2/ 
> > lxc-checkpoint: cgfs.c: cgroup_rmdir: 207 Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/palloc/lxc/d1-2
> > lxc-checkpoint: cgfs.c: cgroup_rmdir: 207 Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/debug/lxc/d1-2
> > lxc-checkpoint: cgfs.c: cgroup_rmdir: 207 Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/hugetlb/lxc/d1-2
> > lxc-checkpoint: cgfs.c: cgroup_rmdir: 207 Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/perf_event/lxc/d1-2
> > lxc-checkpoint: cgfs.c: cgroup_rmdir: 207 Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/blkio/lxc/d1-2
> > lxc-checkpoint: cgfs.c: cgroup_rmdir: 207 Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/net_cls,net_prio/lxc/d1-2
> > lxc-checkpoint: cgfs.c: cgroup_rmdir: 207 Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/freezer/lxc/d1-2
> > lxc-checkpoint: cgfs.c: cgroup_rmdir: 207 Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/devices/lxc/d1-2
> > lxc-checkpoint: cgfs.c: cgroup_rmdir: 207 Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/memory/lxc/d1-2
> > lxc-checkpoint: cgfs.c: cgroup_rmdir: 207 Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/cpu,cpuacct/lxc/d1-2
> > lxc-checkpoint: cgfs.c: cgroup_rmdir: 207 Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/cpuset/lxc/d1-2
> > lxc-checkpoint: lxccontainer.c: do_lxcapi_restore: 3772 restore process died
> > Restoring d1 failed.

I've seen these from the restore code before and they're benign
(basically, the restore failed and not all the tasks were wait()ed on
before we try to delete the cgroup). That said, it's ugly and I'll try
to post a fix soon.

> > Warn  (cr-restore.c:1041): Set CLONE_PARENT | CLONE_NEWPID but it might cause restore problem,because not all kernels support such clone flags combinations!
> > RTNETLINK answers: File exists
> > RTNETLINK answers: File exists
> > RTNETLINK answers: File exists
> > RTNETLINK answers: File exists
> >      1: Warn  (sk-unix.c:1229): sk unix: Can't unlink stale socket 0x36a8 peer 0 (name /run/systemd/notify dir -)
> >      1: Warn  (sk-unix.c:1229): sk unix: Can't unlink stale socket 0x36aa peer 0 (name /run/systemd/private dir -)
> >      1: Warn  (sk-unix.c:1229): sk unix: Can't unlink stale socket 0x36b4 peer 0 (name /run/systemd/shutdownd dir -)
> >      1: Warn  (sk-unix.c:1229): sk unix: Can't unlink stale socket 0x36b6 peer 0 (name /run/systemd/journal/dev-log dir -)
> >      1: Warn  (sk-unix.c:1229): sk unix: Can't unlink stale socket 0x36ba peer 0 (name /run/systemd/journal/stdout dir -)
> >      1: Warn  (sk-unix.c:1229): sk unix: Can't unlink stale socket 0x36bc peer 0 (name /run/systemd/journal/socket dir -)
> >      1: Warn  (sk-unix.c:1229): sk unix: Can't unlink stale socket 0x5bad peer 0x70ea (name /run/systemd/journal/stdout dir -)
> >      1: Warn  (sk-unix.c:1229): sk unix: Can't unlink stale socket 0x6da7 peer 0x3788 (name /run/systemd/journal/stdout dir -)
> >      1: Warn  (sk-unix.c:1229): sk unix: Can't unlink stale socket 0x6da6 peer 0x5f21 (name /run/systemd/journal/stdout dir -)
> >      1: Warn  (sk-unix.c:1229): sk unix: Can't unlink stale socket 0x6da8 peer 0x784b (name /run/systemd/journal/stdout dir -)
> >      1: Warn  (sk-unix.c:1229): sk unix: Can't unlink stale socket 0x6da9 peer 0x6b10 (name /run/systemd/journal/stdout dir -)
> >      1: Warn  (sk-unix.c:1229): sk unix: Can't unlink stale socket 0x6daa peer 0x6159 (name /run/systemd/journal/stdout dir -)
> >     68: Error (sk-packet.c:419): Can't bind packet socket: Invalid argument
> > Error (cr-restore.c:1236): 3159 killed by signal 19
> > Error (cr-restore.c:1236): 3159 killed by signal 19
> > Error (cr-restore.c:1933): Restoring FAILED.

Here the real problem. bind() is failing, probably because the unlink
above failed. Unfortunately, we don't log the reason for the bind()
failing, can you try with the attached patch?

Pavel, perhaps we should apply this so it does report the error?

Tycho

> > --- Checkpoint/Restore ---
> > checkpoint restore: enabled
> > CONFIG_FHANDLE: enabled
> > CONFIG_EVENTFD: enabled
> > CONFIG_EPOLL: enabled
> > CONFIG_UNIX_DIAG: enabled
> > CONFIG_INET_DIAG: enabled
> > CONFIG_PACKET_DIAG: enabled
> > CONFIG_NETLINK_DIAG: enabled
> > File capabilities: enabled
> > 
> > 
> > How can I solve this problem? It's the same as the ubuntu.
> > 
> > 
> > 
> > _______________________________________________
> > CRIU mailing list
> > CRIU at openvz.org
> > https://lists.openvz.org/mailman/listinfo/criu
> > 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-unix-report-errors-from-unlinking-stale-sockets.patch
Type: text/x-diff
Size: 1325 bytes
Desc: not available
URL: <http://lists.openvz.org/pipermail/criu/attachments/20151014/84a58552/attachment.bin>


More information about the CRIU mailing list