[CRIU] AUFS Support in CRIU
Pavel Emelyanov
xemul at parallels.com
Mon Aug 18 03:03:20 PDT 2014
On 08/16/2014 02:12 AM, Saied Kazemi wrote:
> > No, we cannot use mnt_id because it's uninitialized (-1). Even if it were initialized, I
> > am not sure mnt_id alone would suffice to identify AUFS pathnames that should be replaced.
>
> It's uninitialized, because CRIU hasn't read one in. I mean -- if we look at the fdinfo
> data of a file opened on AUFS, would the mnt_id value be the one from AUFS mount, or from
> the underlying FS mount?
>
>
> I get -1 because my system, a vanilla Ubuntu 14.04 (kernel 3.13.0), does not have mnt_id in fdinfo.
> So it's not safe to rely on mnt_id.
BTW, you might have problems dumping Docker containers w/o that patch -- this
field is required to properly resolve multiple mount namespaces if they are met
in a container.
> You would see /etc/hosts, which is correct.
>
> # ls -l /proc/16958/fd
> total 0
> lr-x------ 1 root root 64 Aug 15 14:08 0 -> pipe:[96434]
> l-wx------ 1 root root 64 Aug 15 14:08 1 -> pipe:[96435]
> l-wx------ 1 root root 64 Aug 15 14:08 2 -> pipe:[96436]
> lr-x------ 1 root root 64 Aug 15 14:08 3 -> /etc/hosts
>
>
>
>
> And another thing -- if we do open("/proc/pid/map_files/<something>") what would we see in the
> newly appeared /proc/pid/fd/$fd ?
>
> If /proc/pid/fd reveals "correct" paths, i.e. the /etc/hosts one, while /proc/pid/map_files/ show
> "spoiled" paths, i.e. -- the ones from branches, then this is a kernel bug -- both directories
> should behave the same way.
>
>
> You would see the correct path, /etc/hosts.
>
> # ls -l /proc/16958/map_files
> total 0
> lr-------- 1 root root 64 Aug 15 14:09 400000-4c0000 -> /var/lib/docker/aufs/diff/<ID>/a.out
> lr-------- 1 root root 64 Aug 15 14:09 6bf000-6c2000 -> /var/lib/docker/aufs/diff/<ID>/a.out
> lr-------- 1 root root 64 Aug 15 14:09 7f8f30eb7000-7f8f30eb8000 -> /etc/hosts
>
> It's important to note that while both /proc/<PID>/fd and /proc/<PID>/map_files show the same correct
> path, as you see above, the executable file (a.out) shows the path within the AUFS branch which is
> wrong. AUFS should not reveal its internals. This also contradicts what smaps shows (see below), hence
> the need to "fix up" branch paths that are obtained by readlink()ing map_files entries.
>
> # grep '^[0-9a-f]' /proc/16958/smaps
> 00400000-004c0000 r-xp 00000000 00:25 62 /a.out
> 006bf000-006c2000 rw-p 000bf000 00:25 62 /a.out
> 006c2000-006c5000 rw-p 00000000 00:00 0
> 02405000-02428000 rw-p 00000000 00:00 0 [heap]
> 7f8f30eb7000-7f8f30eb8000 r--p 00000000 08:01 311525 /etc/hosts
> 7f8f30eb8000-7f8f30eb9000 rw-p 00000000 00:00 0
> 7fff457f5000-7fff45816000 rw-p 00000000 00:00 0 [stack]
> 7fff45980000-7fff45982000 r-xp 00000000 00:00 0 [vdso]
> ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
I'm confused, sorry :( So _some_ files from AUFS are shown with correct paths and _some_
with spoiled ones. And the "spoiled" paths you've seen so far are only execulables met in
the map_files directory. Right?
>
>
>
> >
> >
> > > The vma fd is dumped in dump_filemap() and in that place we call
> > > get_fd_mntid() for missing mount ID. Can we fixup the device there too?
> > >
> > >
> > > The way I handle it is like this: If the link file descriptor points to an AUFS branch name,
> > > we replace it with a pathname from root. Here is the call sequence from dump_filemap():
> > >
> > > dump_filemap()
> > > dump_one_reg_file(lfd)
> > > fill_fdlink(lfd)
> > > read_fd_link(lfd)
> > > readlink(lfd) // returns path in branch
> > > fixup_aufs_brnach() // replaces path in branch with path from root
> > > check_path_remap()
> > >
> > > In other words, when we see a pathname in a branch, we replace it with a pathname from root
> > > as if we never saw the branch pathname. When using a link file descriptor in fstat(), because
> > > the kernel returns the stat info of the pathname in branch, we use stat() with a pathname
> > > from the root instead of fstat(). If we didn't do this, we'd get different device/inode
> > > values and CRIU fails with an error message like "Unaccessible path opened 33:23, need 2049:53764".
> >
> > So if you stat() an fd opened on AUFS you get dev:inode pair from AUFS, not from the
> > underlying ext4?
> >
> >
> > Exactly. That's why we need to always stat the file either from the AUFS root or from the branch where
> > it really lives. We cannot stat it from one path and expect to see the same dev:ino from a different path,
> > although it's really the same file!
>
> So, if by stat()-ing a file we see "virtual", i.e. the AUFS's device and inode, then we
> can use this to distinguish AUFS files from non AUFS ones.
>
> BTW, we have similar problem with btrfs -- it reports "wrong" device number in some
> places, so we have to use the phys_stat_dev_match() helper to compare devices, maybe
> something similar can be done with AUFS.
>
>
> The right solution is to fix map_files. Once map_files shows /foo/bar, instead of /path-to-branch/foo/bar,
> we won't need any "fix up" code.
If the answer to my above question "yes", then this is a bug from AUFS, not
from map_files (since links show correct paths on most of the files).
> OK, can we rework this part like this -- we add the .post_parse callback on the fstype,
> and once parse_mountinfo meets an fs with one set it puts the created mi into separate
> list and walks it, calling ->post_parse(), after all the mount-infos get collected.
>
>
> Yes, this would work but the good news is that we no longer need the "reference" stuff (see below).
Great! :)
> Yes, I confirmed that "mnt: Don't delay external mount points" fixed the issue so I ripped
> out all "reference" code (yay!).
Cool!
> Please review/test the attached patch set, which I rebased to head before sending. Once the
> final changes are made, I will send you a signed-off patch with a descriptive commit message.
So far I have only four comments, please, find the inline.
> @@ -197,6 +199,20 @@ int fill_fdlink(int lfd, const struct fd_parms *p, struct fd_link *link)
> return -1;
> }
>
> + /*
> + * If the link file descriptor (lfd) points to a
> + * file in an AUFS branch, the pathname should be
> + * replaced with a pathname from root.
> + */
> + if (opts.aufs) {
> + int n = sizeof link->name - 1;
> + n = fixup_aufs_path(&link->name[1], n, true);
Since you say, that we only see spoiled paths in map_files, do we need
the path fixup here? Isn't the one performed in parsing the mappings
not enough?
> + if (n < 0)
> + return -1;
> + if (n > 0)
> + len = n;
> + }
> +
> link->len = len + 1;
> return 0;
> }
> +int fixup_aufs_path(char *path, int size, bool chop)
> +{
> + char rpath[PATH_MAX];
> + int n;
> + int blen;
> +
> + if (aufs_branches == NULL) {
> + pr_err("No aufs branches to search for %s\n", path);
> + return -1;
Presumably, if we don't have AUFS we shouldn't fail :)
> + }
> +
> +int parse_mountinfo_aufs_sbinfo(pid_t pid, char *sbinfo, int len)
> +{
> + char line[PATH_MAX];
> + char *fstype = NULL;
> + char *opt = NULL;
> + char *cp;
> + int n, ret;
> +
> + n = get_mntent_by_mountpoint(pid, "/", line, sizeof line);
> + if (n < 0)
> + return -1;
> +
Presumably, this would get fixed by rework of mountinfo's parser, right?
> +int aufs_parse(struct mount_info *new)
> +{
> + char *cp;
> +
> + if (!opts.aufs_root || strcmp(new->mountpoint, "./"))
If we've met AUFS, but no --aufs-root specified, shouldn't we fail the dump?
> + return 0;
> +
> + cp = malloc(strlen(opts.aufs_root) + 1);
> + if (!cp) {
> + pr_err("Cannot allocate memory for %s\n", opts.aufs_root);
> + return -1;
> + }
> + strcpy(cp, opts.aufs_root);
> +
> + pr_debug("Replacing %s with %s\n", new->root, opts.aufs_root);
> + free(new->root);
> + new->root = cp;
What if we have more than one AUFS mountpoint with different roots? I'm
OK if we only support one, but then we should fail the 2nd mountpoint met.
> +
> + return 0;
> +}
Thanks,
Pavel
More information about the CRIU
mailing list