[CRIU] race dumping fds in wily LXC containers

Pavel Emelyanov xemul at parallels.com
Wed Jul 15 03:01:15 PDT 2015


On 07/15/2015 01:08 AM, Tycho Andersen wrote:
> On Mon, Jul 13, 2015 at 03:07:10PM +0300, Pavel Emelyanov wrote:
>> On 07/01/2015 12:45 AM, Tycho Andersen wrote:
>>> Hi all,
>>>
>>> I'm trying to debug a strange race that happens sometimes when
>>> checkpointing wily ubuntu LXC containers. The symptom of the race is:
>>>
>>> (00.020270) Error (files-reg.c:527): Can't link remap to /sys/fs/cgroup/systemd/lxc/w1 (deleted): Operation not permitted
>>
>> Ouch... This is a file on sysfs :)
> 
> Well, it's not on a sysfs, since we have lxcfs [1] bind mounted there
> so that the containers get a "more accurate" picture of their cgroups.
> I suspect this is probably some strange fuse thing, but I'm not sure
> what.
> 
> [1]: https://github.com/lxc/lxcfs

Ah, I see.

>>> The problem here seems to be that the readlink on criu's
>>> /proc/self/fd/$the_fd_for_that_file gives a "(deleted)" result, which
>>> subsequently confuses things. (In fact, I'm a little confused about
>>> how dump_linked_remap() works at all, given that just before it is
>>> called the fstatat() fails; but let's ignore that for now.)
>>>
>>> The strangest part of all this is that after the dump fails, I can
>>> attach to the container and do a readlink on the /proc/pid/fd/$fd for
>>> the pid in question, and it gives me the right (i.e. non-"(deleted)")
>>> answer.
>>
>> The link remap idea is like this.
>>
>> fd = open("/foo/bar")
>> link("/foo/bar", "/foo/bar2")
>> unlink("/foo/bar")
>>
>> In this case you will have an inode with two names -- "/foo/bar"
>> and "/foo/bar2", but the former name is not visible, since you
>> have unlinked one.
>>
>> This situation will be detected by CRIU like this:
>>
>> first we will check for the /proc/pid/fd/fd link for the name. It
>> will be "/foo/bar (deleted)". The "deleted" suffix is added by the
>> kernel when it sees that the dentry in question is not hashed, which
>> is the case for the "bar" dentry.
>>
>> Then CRIU will try to stat() this name to check whether the file can
>> be still accessed by one. For the "/foo/bar" it will not be the case.
>>
>> Then CRIU will fstat() the descriptor and will see the n_link count
>> being 1 ("/foo/bar2" name is alive).
> 
> Right, this fstat() fails because the previous readlink() returned a
> "(deleted)", i.e. rpath in the fstatat() call in check_path_remap()
> has this "(deleted)".

No, we don't check for "(deleted)" to take any decisions. Only stat/fstat
results comparisons. The only thing we do with it is strip one from the
file path if it's there :)

>> After this the link remap will be called.
>>
>> In you case it seems to be the kernel spoofing the /sys/ files names
>> somehow so that criu is not able to stat() the name in the first place.
> 
> I think the initial stat succeeds somehow (since we don't get an error
> there and it contiues on), but the subsequent readlink tacks on
> "(deleted)" and thus the fstat of that file fails, which doesn't make
> much sense to me. The file definitely exists, it's like there is some
> problem readlink()ing it (perhaps because it is sent over a unix
> socket or something? not sure).

The reason for going to link remap is stat (on a file descriptor) succeeded
and reported non zero link count AND the subsequent fstat() on file path 
reported ENOENT. (And an NFS special-care, but I don't think it's the case).

-- Pavel


More information about the CRIU mailing list