[CRIU] AUFS Support in CRIU

Mon Aug 18 03:03:20 PDT 2014

On 08/16/2014 02:12 AM, Saied Kazemi wrote:

>     > No, we cannot use mnt_id because it's uninitialized (-1).  Even if it were initialized, I
>     > am not sure mnt_id alone would suffice to identify AUFS pathnames that should be replaced.
> 
>     It's uninitialized, because CRIU hasn't read one in. I mean -- if we look at the fdinfo
>     data of a file opened on AUFS, would the mnt_id value be the one from AUFS mount, or from
>     the underlying FS mount?
> 
> 
> I get -1 because my system, a vanilla Ubuntu 14.04 (kernel 3.13.0), does not have mnt_id in fdinfo.
> So it's not safe to rely on mnt_id.

BTW, you might have problems dumping Docker containers w/o that patch -- this
field is required to properly resolve multiple mount namespaces if they are met
in a container.

> You would see /etc/hosts, which is correct.
> 
> # ls -l /proc/16958/fd
> total 0
> lr-x------ 1 root root 64 Aug 15 14:08 0 -> pipe:[96434]
> l-wx------ 1 root root 64 Aug 15 14:08 1 -> pipe:[96435]
> l-wx------ 1 root root 64 Aug 15 14:08 2 -> pipe:[96436]
> lr-x------ 1 root root 64 Aug 15 14:08 3 -> /etc/hosts
> 
>  
> 
> 
>     And another thing -- if we do open("/proc/pid/map_files/<something>") what would we see in the
>     newly appeared /proc/pid/fd/$fd ?
> 
>     If /proc/pid/fd reveals "correct" paths, i.e. the /etc/hosts one, while /proc/pid/map_files/ show
>     "spoiled" paths, i.e. -- the ones from branches, then this is a kernel bug -- both directories
>     should behave the same way.
> 
> 
> You would see the correct path, /etc/hosts.
> 
> # ls -l /proc/16958/map_files 
> total 0
> lr-------- 1 root root 64 Aug 15 14:09 400000-4c0000 -> /var/lib/docker/aufs/diff/<ID>/a.out
> lr-------- 1 root root 64 Aug 15 14:09 6bf000-6c2000 -> /var/lib/docker/aufs/diff/<ID>/a.out
> lr-------- 1 root root 64 Aug 15 14:09 7f8f30eb7000-7f8f30eb8000 -> /etc/hosts
> 
> It's important to note that while both /proc/<PID>/fd and /proc/<PID>/map_files show the same correct
> path, as you see above, the executable file (a.out) shows the path within the AUFS branch which is
> wrong.  AUFS should not reveal its internals.  This also contradicts what smaps shows (see below), hence
> the need to "fix up" branch paths that are obtained by readlink()ing map_files entries.
> 
> # grep '^[0-9a-f]' /proc/16958/smaps 
> 00400000-004c0000 r-xp 00000000 00:25 62                                 /a.out
> 006bf000-006c2000 rw-p 000bf000 00:25 62                                 /a.out
> 006c2000-006c5000 rw-p 00000000 00:00 0 
> 02405000-02428000 rw-p 00000000 00:00 0                                  [heap]
> 7f8f30eb7000-7f8f30eb8000 r--p 00000000 08:01 311525                     /etc/hosts
> 7f8f30eb8000-7f8f30eb9000 rw-p 00000000 00:00 0 
> 7fff457f5000-7fff45816000 rw-p 00000000 00:00 0                          [stack]
> 7fff45980000-7fff45982000 r-xp 00000000 00:00 0                          [vdso]
> ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

I'm confused, sorry :( So _some_ files from AUFS are shown with correct paths and _some_
with spoiled ones. And the "spoiled" paths you've seen so far are only execulables met in
the map_files directory. Right?

>   
> 
> 
>     >
>     >
>     >     >     The vma fd is dumped in dump_filemap() and in that place we call
>     >     >     get_fd_mntid() for missing mount ID. Can we fixup the device there too?
>     >     >
>     >     >
>     >     > The way I handle it is like this:  If the link file descriptor points to an AUFS branch name,
>     >     > we replace it with a pathname from root.  Here is the call sequence from dump_filemap():
>     >     >
>     >     > dump_filemap()
>     >     >    dump_one_reg_file(lfd)
>     >     >       fill_fdlink(lfd)
>     >     >          read_fd_link(lfd)
>     >     >             readlink(lfd)               // returns path in branch
>     >     >          fixup_aufs_brnach()     // replaces path in branch with path from root
>     >     >       check_path_remap()
>     >     >
>     >     > In other words, when we see a pathname in a branch, we replace it with a pathname from root
>     >     > as if we never saw the branch pathname.  When using a link file descriptor in fstat(), because
>     >     > the kernel returns the stat info of the pathname in branch, we use stat() with a pathname
>     >     > from the root instead of fstat().  If we didn't do this, we'd get different device/inode
>     >     > values and CRIU fails with an error message like "Unaccessible path opened 33:23, need 2049:53764".
>     >
>     >     So if you stat() an fd opened on AUFS you get dev:inode pair from AUFS, not from the
>     >     underlying ext4?
>     >
>     >
>     > Exactly.  That's why we need to always stat the file either from the AUFS root or from the branch where
>     > it really lives.  We cannot stat it from one path and expect to see the same dev:ino from a different path,
>     > although it's really the same file!
> 
>     So, if by stat()-ing a file we see "virtual", i.e. the AUFS's device and inode, then we
>     can use this to distinguish AUFS files from non AUFS ones.
> 
>     BTW, we have similar problem with btrfs -- it reports "wrong" device number in some
>     places, so we have to use the phys_stat_dev_match() helper to compare devices, maybe
>     something similar can be done with AUFS.
> 
> 
> The right solution is to fix map_files.  Once map_files shows /foo/bar, instead of /path-to-branch/foo/bar,
> we won't need any "fix up" code.

If the answer to my above question "yes", then this is a bug from AUFS, not
from map_files (since links show correct paths on most of the files).

>     OK, can we rework this part like this -- we add the .post_parse callback on the fstype,
>     and once parse_mountinfo meets an fs with one set it puts the created mi into separate
>     list and walks it, calling ->post_parse(), after all the mount-infos get collected.
> 
> 
> Yes, this would work but the good news is that we no longer need the "reference" stuff (see below).

Great! :)

> Yes, I confirmed that "mnt: Don't delay external mount points" fixed the issue so I ripped
> out all "reference" code (yay!).

Cool!

> Please review/test the attached patch set, which I rebased to head before sending.  Once the
> final changes are made, I will send you a signed-off patch with a descriptive commit message.

So far I have only four comments, please, find the inline.

> @@ -197,6 +199,20 @@ int fill_fdlink(int lfd, const struct fd_parms *p, struct fd_link *link)
>  		return -1;
>  	}
>  
> +	/*
> +	 * If the link file descriptor (lfd) points to a
> +	 * file in an AUFS branch, the pathname should be
> +	 * replaced with a pathname from root.
> +	 */
> +	if (opts.aufs) {
> +		int n = sizeof link->name - 1;
> +		n = fixup_aufs_path(&link->name[1], n, true);

Since you say, that we only see spoiled paths in map_files, do we need
the path fixup here? Isn't the one performed in parsing the mappings
not enough?

> +		if (n < 0)
> +			return -1;
> +		if (n > 0)
> +			len = n;
> +	}
> +
>  	link->len = len + 1;
>  	return 0;
>  }

> +int fixup_aufs_path(char *path, int size, bool chop)
> +{
> +	char rpath[PATH_MAX];
> +	int n;
> +	int blen;
> +
> +	if (aufs_branches == NULL) {
> +		pr_err("No aufs branches to search for %s\n", path);
> +		return -1;

Presumably, if we don't have AUFS we shouldn't fail :)

> +	}
> +

> +int parse_mountinfo_aufs_sbinfo(pid_t pid, char *sbinfo, int len)
> +{
> +	char line[PATH_MAX];
> +	char *fstype = NULL;
> +	char *opt = NULL;
> +	char *cp;
> +	int n, ret;
> +
> +	n = get_mntent_by_mountpoint(pid, "/", line, sizeof line);
> +	if (n < 0)
> +		return -1;
> +

Presumably, this would get fixed by rework of mountinfo's parser, right?

> +int aufs_parse(struct mount_info *new)
> +{
> +	char *cp;
> +
> +	if (!opts.aufs_root || strcmp(new->mountpoint, "./"))

If we've met AUFS, but no --aufs-root specified, shouldn't we fail the dump?

> +		return 0;
> +
> +	cp = malloc(strlen(opts.aufs_root) + 1);
> +	if (!cp) {
> +		pr_err("Cannot allocate memory for %s\n", opts.aufs_root);
> +		return -1;
> +	}
> +	strcpy(cp, opts.aufs_root);
> +
> +	pr_debug("Replacing %s with %s\n", new->root, opts.aufs_root);
> +	free(new->root);
> +	new->root = cp;

What if we have more than one AUFS mountpoint with different roots? I'm
OK if we only support one, but then we should fail the 2nd mountpoint met.

> +
> +	return 0;
> +}

Thanks,
Pavel