[CRIU] race dumping fds in wily LXC containers

Tycho Andersen tycho.andersen at canonical.com
Tue Jul 14 15:08:37 PDT 2015


On Mon, Jul 13, 2015 at 03:07:10PM +0300, Pavel Emelyanov wrote:
> On 07/01/2015 12:45 AM, Tycho Andersen wrote:
> > Hi all,
> > 
> > I'm trying to debug a strange race that happens sometimes when
> > checkpointing wily ubuntu LXC containers. The symptom of the race is:
> > 
> > (00.020270) Error (files-reg.c:527): Can't link remap to /sys/fs/cgroup/systemd/lxc/w1 (deleted): Operation not permitted
> 
> Ouch... This is a file on sysfs :)

Well, it's not on a sysfs, since we have lxcfs [1] bind mounted there
so that the containers get a "more accurate" picture of their cgroups.
I suspect this is probably some strange fuse thing, but I'm not sure
what.

[1]: https://github.com/lxc/lxcfs

> > The problem here seems to be that the readlink on criu's
> > /proc/self/fd/$the_fd_for_that_file gives a "(deleted)" result, which
> > subsequently confuses things. (In fact, I'm a little confused about
> > how dump_linked_remap() works at all, given that just before it is
> > called the fstatat() fails; but let's ignore that for now.)
> > 
> > The strangest part of all this is that after the dump fails, I can
> > attach to the container and do a readlink on the /proc/pid/fd/$fd for
> > the pid in question, and it gives me the right (i.e. non-"(deleted)")
> > answer.
> 
> The link remap idea is like this.
> 
> fd = open("/foo/bar")
> link("/foo/bar", "/foo/bar2")
> unlink("/foo/bar")
> 
> In this case you will have an inode with two names -- "/foo/bar"
> and "/foo/bar2", but the former name is not visible, since you
> have unlinked one.
> 
> This situation will be detected by CRIU like this:
> 
> first we will check for the /proc/pid/fd/fd link for the name. It
> will be "/foo/bar (deleted)". The "deleted" suffix is added by the
> kernel when it sees that the dentry in question is not hashed, which
> is the case for the "bar" dentry.
> 
> Then CRIU will try to stat() this name to check whether the file can
> be still accessed by one. For the "/foo/bar" it will not be the case.
> 
> Then CRIU will fstat() the descriptor and will see the n_link count
> being 1 ("/foo/bar2" name is alive).

Right, this fstat() fails because the previous readlink() returned a
"(deleted)", i.e. rpath in the fstatat() call in check_path_remap()
has this "(deleted)".

> After this the link remap will be called.
> 
> In you case it seems to be the kernel spoofing the /sys/ files names
> somehow so that criu is not able to stat() the name in the first place.

I think the initial stat succeeds somehow (since we don't get an error
there and it contiues on), but the subsequent readlink tacks on
"(deleted)" and thus the fstat of that file fails, which doesn't make
much sense to me. The file definitely exists, it's like there is some
problem readlink()ing it (perhaps because it is sent over a unix
socket or something? not sure).

Tycho

> > Any ideas as to what's going on here? My best guess is a kernel bug
> > related to sending fds (the underlying filesystem is lxcfs, a fuse
> > filesystem, not the traditional cgroup fs), but that's just a hunch.
> > 
> > Any thoughts would be appreciated.
> > 
> > Tycho
> > _______________________________________________
> > CRIU mailing list
> > CRIU at openvz.org
> > https://lists.openvz.org/mailman/listinfo/criu
> > .
> > 
> 


More information about the CRIU mailing list