[CRIU] LXC checkpoint/restore HOWTO using upstream tools
Tycho Andersen
tycho.andersen at canonical.com
Fri Sep 26 07:04:39 PDT 2014
Hi Krystof,
On Thu, Sep 25, 2014 at 11:37:28PM +0000, Zmudzinski, Krystof C wrote:
> Tycho,
>
> I missed that patch.
It is in lxc's git as c49ecd787d2 now.
Tycho
> Krystof
>
> -----Original Message-----
> From: Tycho Andersen [mailto:tycho.andersen at canonical.com]
> Sent: Wednesday, September 24, 2014 1:55 PM
> To: Zmudzinski, Krystof C
> Cc: CRIU
> Subject: Re: [CRIU] LXC checkpoint/restore HOWTO using upstream tools
>
> Krystof,
>
> On Wed, Sep 24, 2014 at 05:27:43PM +0000, Zmudzinski, Krystof C wrote:
> > Tycho,
> >
> > I can confirm that the following steps work (executed as root):
> > 1. apt-get build-dep lxc
> > 2. get lxc source and build
> > a. ./autogen.sh
> > b. ./configure
> > c. make install
> > 3. add /usr/local/lib/x86_64-linux-gnu to /etc/ld.so.conf.d/x86_64-linux-gnu.conf
> > 4. ldconfig -v
> > 5. Don't create any containers
> > a. lxcbr0 missing
> > b. cgroups not mounting
> > c. apparmor not installed in correct directory
> > 6. apt-get install lxc
> > 7. reboot (or restart network but we need lxcbr0)
> > 8. lxc-create -t ubuntu -n u1 -- -r trusty -a amd64
> > 9. lxc-start, lxc-attach, and lxc-stop should work
> > 10. get criu
> > 11. apt-get install protobuf-c-compiler
> > 12. edit Makefile end remove install-man
> > 13. build criu (make install)
> > 14. lxc-start, lxc-attach
> > 15. umount /sys/fs/fuse/connections/
> > 16. exit
> > 17. lxc-checkpoint -s -D /tmp/checkpoint -n u1
> > a. this can fail because a socket has data
> > b. just wait and repeat
> > 18. lxc-checkpoint -r -D /tmp/checkpoint -n u1
> > 19. this doesn't work correctly
> > a. container is resumed
> > b. but lxc-attach, lxc-stop don't work
> > 20. solution
> > a. after lxc-start do lxc-info and note the IP address
> > c. don't forget to umount /sys/fs/fuse/connections/
> > d. after resuming do ssh ubuntu at iIP_address
> >
> >
> > But we are back to the original problems I reported a while ago:
> > 1. After resume the container is gone and the only thing left is
> > lxc-checkpoint process
>
> There was a bug with how lxc-checkpoint detected criu exiting (in particular, it would hang if criu was killed by a signal) in some cases, I just sent a patch to the ML about that this morning.
>
> > 2. lxc-info reports that the container is running but show a couple of
> > errors 3. After killing lxc-checkpoint process lxc-info shows
> > correctly that the container is stopped
> >
> > So I removed DECLARE_ARG("--restore-detached"); from lxccontainer.c,
> > 1. The container is running but lxc-attach, lxc-info lxc-stop just hang.
> > 2. Sometimes the process tree looks has these defunct processes:
>
> Yes, lxc wants to be the parent of the restored process, so we need to --restore-detached.
>
> > 30889 pts/0 S 0:00 lxc-checkpoint -r -D /tmp/checkpoint/ -n u1
> > 30890 pts/0 S 0:00 \_ /usr/local/sbin/criu restore --tcp-establishe
> > 30892 ? Ss 0:00 \_ /sbin/init
> > 30934 ? S 0:00 \_ upstart-file-bridge --daemon
> > 30935 ? S 0:00 \_ upstart-socket-bridge --daemon
> > 30936 ? Ss 0:00 \_ /sbin/getty -8 38400 tty1
> > 30937 ? Ss 0:00 \_ /sbin/getty -8 38400 console
> > 30938 ? Ss 0:00 \_ /sbin/getty -8 38400 tty3
> > 30939 ? Ss 0:00 \_ /sbin/getty -8 38400 tty2
> > 30940 ? Ss 0:00 \_ /sbin/getty -8 38400 tty4
> > 30941 ? Ss 0:00 \_ cron
> > 30942 ? S 0:00 \_ upstart-udev-bridge --daemon
> > 30943 ? Ss 0:00 \_ /usr/sbin/sshd -D
> > 30944 ? Zs 0:00 \_ [criu] <defunct>
> > 30945 ? Zs 0:00 \_ [criu] <defunct>
> > 30946 ? Ss 0:00 \_ /lib/systemd/systemd-udevd --daemon
> >
> > and restore.log shows this:
> > RTNETLINK answers: File exists
> > RTNETLINK answers: File exists
> > RTNETLINK answers: File exists
> > 233: Error (files-reg.c:820): File var/log/auth.log has bad size
> > 27848 (\ expect 27588)
>
> This to me looks like the real cause of all the problems that you're having. Something is writing to var/log/auth.log after the dump takes place, and criu doesn't like that.
>
> Tycho
>
> > 291: Error (sk-unix.c:695): Can't connect 0xe86b socket: Connection
> > refu\ sed
> >
> > But sometimes the tree looks OK:
> > 31711 pts/1 S 0:00 lxc-checkpoint -r -D /tmp/checkpoint/ -l DEBUG -n u1
> > 31712 pts/1 S 0:00 \_ /usr/local/sbin/criu restore --tcp-established --evasive-devices --file-locks --link-remap --manage-cgroups --action-script /usr/local/libexec/lxc/lxc-restore
> > 31714 ? Ss 0:00 \_ /sbin/init
> > 31756 ? S 0:00 \_ upstart-file-bridge --daemon
> > 31759 ? S 0:00 \_ upstart-socket-bridge --daemon
> > 31763 ? Ss 0:00 \_ /usr/sbin/sshd -D
> > 31764 ? Ss 0:00 \_ cron
> > 31765 ? Ss 0:00 \_ dhclient -1 -v -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases eth0
> > 31766 ? S 0:00 \_ upstart-udev-bridge --daemon
> > 31767 ? Ssl 0:00 \_ rsyslogd
> > 31768 ? Ss 0:00 \_ /lib/systemd/systemd-udevd --daemon
> > 31783 tty7 Ss+ 0:00 \_ /sbin/getty -8 38400 console
> > 32135 ? Ss 0:00 \_ /sbin/getty -8 38400 tty1
> > 32139 ? Ss 0:00 \_ /sbin/getty -8 38400 tty3
> > 32142 ? Ss 0:00 \_ /sbin/getty -8 38400 tty2
> > 32143 ? Ss 0:00 \_ /sbin/getty -8 38400 tty4
> >
> > and restore.log just has these:
> > RTNETLINK answers: File exists
> > RTNETLINK answers: File exists
> > RTNETLINK answers: File exists
> >
> > I can still do ssh.
> >
> > Krystof
> >
> > -----Original Message-----
> > From: Tycho Andersen [mailto:tycho.andersen at canonical.com]
> > Sent: Wednesday, September 24, 2014 9:25 AM
> > To: Zmudzinski, Krystof C
> > Cc: CRIU
> > Subject: Re: [CRIU] LXC checkpoint/restore HOWTO using upstream tools
> >
> > Hi Krystof,
> >
> > On Wed, Sep 24, 2014 at 04:14:00PM +0000, Zmudzinski, Krystof C wrote:
> > > Tycho,
> > >
> > > I have finally succeeded to configure my system so I'm able to suspend and resume my container using LXC built from source.
> >
> > Cool!
> >
> > > I started with a fresh Ubuntu 14.04 LTS.
> > >
> > > The trick that did it for me was to first follow the instructions
> > > from
> > > http://bazaar.launchpad.net/~tycho-s/+junk/snapshot-instructions/vie
> > > w/
> > > head:/README
> > >
> > > This, by itself, didn't quite work as I was unable to start my container. The error messages had to do first with missing cgroups; after I fixed that with missing apparmor profiles.
> > >
> > > So I just installed the default lxc package (i.e., sudo apt-get
> > > install lxc)
> >
> > I guess the packaging probably also sets up a bridge, if you didn't have one already. Perhaps I can add that to the development instructions there.
> >
> > > And everything works now.
> > >
> > > I think that the development package still wants to use
> > > /etc/apparmor.d/lxc/ Instead of /usr/local/etc/apparmor.d/lxc/
> > > because that was the last thing that installing the default lxc fixed.
> > >
> > > What I still don't understand is that I can't find lxc-container-default-with-mounting anywhere on my machine.
> >
> > Is it just a typo in the name? The correct profile name is "lxc-default-with-mounting" (i.e. no -container-).
> >
> > Tycho
> >
> > > I could be completely wrong here so I will try repeating these steps again to confirm.
> > >
> > > Krystof
> > >
> > >
> > > -----Original Message-----
> > > From: Tycho Andersen [mailto:tycho.andersen at canonical.com]
> > > Sent: Tuesday, September 23, 2014 1:36 PM
> > > To: Zmudzinski, Krystof C
> > > Cc: CRIU
> > > Subject: Re: [CRIU] LXC checkpoint/restore HOWTO using upstream
> > > tools
> > >
> > > Hi Krystof,
> > >
> > > On Tue, Sep 23, 2014 at 08:29:30PM +0000, Zmudzinski, Krystof C wrote:
> > > > In fact, the last time everything seems to be working is just before ldconfig -v. I can start, attach and stop. But as soon as I execute ldconfig, lxc-start says that lxc-start: Executing '/sbin/init' with no configuration file may crash the host and lxc-info says that the container doesn't exist. Reverting ldconfig doesn't change anything.
> > >
> > > Yes, that's (presumably) because it is looking at a new lxcpath (/usr/local/var/lib/lxc vs. /var/lib/lxc) when you're using the right liblxc.so. You have a few options here:
> > >
> > > 1. just re-create your container under /usr/local (or just copy it) 2. change your lxcpath to point to the old container 3. re-compile with --prefix /usr so that it installs over the packaged
> > > lxc (I don't recommend this :)
> > >
> > > You can verify which lxcpath things are pointed to via:
> > >
> > > sudo lxc-config lxc.lxcpath
> > >
> > > Tycho
> > >
> > > >
> > > > Krystof
> > > >
> > > > -----Original Message-----
> > > > From: Tycho Andersen [mailto:tycho.andersen at canonical.com]
> > > > Sent: Monday, September 22, 2014 4:32 PM
> > > > To: Zmudzinski, Krystof C
> > > > Cc: CRIU
> > > > Subject: Re: [CRIU] LXC checkpoint/restore HOWTO using upstream
> > > > tools
> > > >
> > > > Hi Krystof,
> > > >
> > > > On Mon, Sep 22, 2014 at 09:44:24PM +0000, Zmudzinski, Krystof C wrote:
> > > > > Tycho,
> > > > >
> > > > > After following the instructions on http://criu.org/LXC, I wanted to install the latest source for LXC and make some changes. How can I build and install the new lxc-* so I overwrite what sudo apt-get install lxc did? Right now, following the instructions on http://bazaar.launchpad.net/~tycho-s/+junk/snapshot-instructions/view/head:/README everything gets installed in different directories and things stop working completely.
> > > >
> > > > What's the output of ldd `which lxc-checkpoint`? Did you do the bit in there about modifying the ld.so.conf?
> > > >
> > > > Tycho
> > > >
> > > > > I always get this error when I try to start a container:
> > > > > > lxc-start 1410804731.572 ERROR lxc_cgfs - Could not find writable mount point for cgroup hierarchy 3 while trying to create cgroup.
> > > > > > lxc-start 1410804731.572 ERROR lxc_start - failed creating cgroups
> > > > > > lxc-start 1410804731.599 ERROR lxc_start - failed to spawn 'u1'
> > > > > > lxc-start 1410804736.604 ERROR lxc_start_ui - The container failed to start.
> > > > >
> > > > > Krystof
> > > > >
> > > > > -----Original Message-----
> > > > > From: Tycho Andersen [mailto:tycho.andersen at canonical.com]
> > > > > Sent: Friday, September 19, 2014 10:53 AM
> > > > > To: Zmudzinski, Krystof C
> > > > > Cc: CRIU
> > > > > Subject: Re: [CRIU] LXC checkpoint/restore HOWTO using upstream
> > > > > tools
> > > > >
> > > > > Hi Krystof,
> > > > >
> > > > > On Fri, Sep 19, 2014 at 05:18:32PM +0000, Zmudzinski, Krystof C wrote:
> > > > > > lxc-checkpoint fails. I did a fresh install of ubuntu 14.04 and followed your instructions. I also installed criu-1.3.1.
> > > > > >
> > > > > > From dump.log:
> > > > > > (00.377632) Error (mount.c:805): fusectl isn't empty: 8388625
> > > > >
> > > > > Ah, that is a good point. Right now CRIU doesn't support dumping any fuse filesystems (i.e., /sys/fs/fuse/connections needs to be empty). I guess stock desktop ubuntu might have some of these enabled. You can either uninstall any fuse modules or try ubuntu-server (or better yet, the cloud images) as a host.
> > > > >
> > > > > I guess maybe we should have lxc-checkpoint look for some of this stuff too, vs. just config.
> > > > >
> > > > > > The container does contain ttys and console:
> > > > > > 5784 ? Ss 0:00 \_ lxc-start -n cn_01
> > > > > > 5804 ? Ss 0:01 \_ /sbin/init
> > > > > > 5998 ? S 0:00 \_ upstart-udev-bridge --daemon
> > > > > > 6009 ? Ss 0:00 \_ /lib/systemd/systemd-udevd --daemon
> > > > > > 6077 ? S 0:00 \_ upstart-socket-bridge --daemon
> > > > > > 6079 ? Ssl 0:00 \_ rsyslogd
> > > > > > 6085 ? S 0:00 \_ upstart-file-bridge --daemon
> > > > > > 6117 ? Ss 0:00 \_ dhclient -1 -v -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases eth0
> > > > > > 6202 ? Ss 0:00 \_ cron
> > > > > > 6207 ? Ss 0:00 \_ /usr/sbin/sshd -D
> > > > > > 7083 ? Ss 0:00 \_ /sbin/getty -8 38400 tty2
> > > > > > 7084 ? Ss 0:00 \_ /sbin/getty -8 38400 tty4
> > > > > > 7085 ? Ss 0:00 \_ /sbin/getty -8 38400 tty3
> > > > > > 7086 ? Ss 0:00 \_ /sbin/getty -8 38400 console
> > > > > > 7087 ? Ss 0:00 \_ /sbin/getty -8 38400 tty1
> > > > > >
> > > > > > I don't think it's enough to just add this to the config file:
> > > > > > # hax for criu
> > > > > > lxc.console = none
> > > > > > lxc.tty = 0
> > > > > > lxc.cgroup.devices.deny = c 5:1 rwm
> > > > > >
> > > > > > because there is this at the very beginning:
> > > > > > # Common configuration
> > > > > > lxc.include = /usr/share/lxc/config/ubuntu.common.conf
> > > > >
> > > > > Why is that a problem? The later values in the config override any earlier ones. lxc-checkpoint will also complain and refuse to dump a container that doesn't have the right configuration bits set, so if it tried to dump, that means it thinks the config is valid.
> > > > >
> > > > > Tycho
More information about the CRIU
mailing list