[CRIU] AUFS Support in CRIU

Mon Aug 18 09:43:52 PDT 2014

On Mon, Aug 18, 2014 at 3:03 AM, Pavel Emelyanov <xemul at parallels.com>
wrote:

> On 08/16/2014 02:12 AM, Saied Kazemi wrote:
>
> >     > No, we cannot use mnt_id because it's uninitialized (-1).  Even if
> it were initialized, I
> >     > am not sure mnt_id alone would suffice to identify AUFS pathnames
> that should be replaced.
> >
> >     It's uninitialized, because CRIU hasn't read one in. I mean -- if we
> look at the fdinfo
> >     data of a file opened on AUFS, would the mnt_id value be the one
> from AUFS mount, or from
> >     the underlying FS mount?
> >
> >
> > I get -1 because my system, a vanilla Ubuntu 14.04 (kernel 3.13.0), does
> not have mnt_id in fdinfo.
> > So it's not safe to rely on mnt_id.
>
> BTW, you might have problems dumping Docker containers w/o that patch --
> this
> field is required to properly resolve multiple mount namespaces if they
> are met
> in a container.
>

Yes, this can become an issue if/when we start using containers with
multiple mount namespaces.

>
>
> > You would see /etc/hosts, which is correct.
> >
> > # ls -l /proc/16958/fd
> > total 0
> > lr-x------ 1 root root 64 Aug 15 14:08 0 -> pipe:[96434]
> > l-wx------ 1 root root 64 Aug 15 14:08 1 -> pipe:[96435]
> > l-wx------ 1 root root 64 Aug 15 14:08 2 -> pipe:[96436]
> > lr-x------ 1 root root 64 Aug 15 14:08 3 -> /etc/hosts
> >
> >
> >
> >
> >     And another thing -- if we do
> open("/proc/pid/map_files/<something>") what would we see in the
> >     newly appeared /proc/pid/fd/$fd ?
> >
> >     If /proc/pid/fd reveals "correct" paths, i.e. the /etc/hosts one,
> while /proc/pid/map_files/ show
> >     "spoiled" paths, i.e. -- the ones from branches, then this is a
> kernel bug -- both directories
> >     should behave the same way.
> >
> >
> > You would see the correct path, /etc/hosts.
> >
> > # ls -l /proc/16958/map_files
> > total 0
> > lr-------- 1 root root 64 Aug 15 14:09 400000-4c0000 ->
> /var/lib/docker/aufs/diff/<ID>/a.out
> > lr-------- 1 root root 64 Aug 15 14:09 6bf000-6c2000 ->
> /var/lib/docker/aufs/diff/<ID>/a.out
> > lr-------- 1 root root 64 Aug 15 14:09 7f8f30eb7000-7f8f30eb8000 ->
> /etc/hosts
> >
> > It's important to note that while both /proc/<PID>/fd and
> /proc/<PID>/map_files show the same correct
> > path, as you see above, the executable file (a.out) shows the path
> within the AUFS branch which is
> > wrong.  AUFS should not reveal its internals.  This also contradicts
> what smaps shows (see below), hence
> > the need to "fix up" branch paths that are obtained by readlink()ing
> map_files entries.
> >
> > # grep '^[0-9a-f]' /proc/16958/smaps
> > 00400000-004c0000 r-xp 00000000 00:25 62
>  /a.out
> > 006bf000-006c2000 rw-p 000bf000 00:25 62
>  /a.out
> > 006c2000-006c5000 rw-p 00000000 00:00 0
> > 02405000-02428000 rw-p 00000000 00:00 0
> [heap]
> > 7f8f30eb7000-7f8f30eb8000 r--p 00000000 08:01 311525
>  /etc/hosts
> > 7f8f30eb8000-7f8f30eb9000 rw-p 00000000 00:00 0
> > 7fff457f5000-7fff45816000 rw-p 00000000 00:00 0
> [stack]
> > 7fff45980000-7fff45982000 r-xp 00000000 00:00 0
> [vdso]
> > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0
> [vsyscall]
>
> I'm confused, sorry :( So _some_ files from AUFS are shown with correct
> paths and _some_
> with spoiled ones. And the "spoiled" paths you've seen so far are only
> execulables met in
> the map_files directory. Right?
>

Yes, you are right.

>
> >
> >
> >
> >     >
> >     >
> >     >     >     The vma fd is dumped in dump_filemap() and in that place
> we call
> >     >     >     get_fd_mntid() for missing mount ID. Can we fixup the
> device there too?
> >     >     >
> >     >     >
> >     >     > The way I handle it is like this:  If the link file
> descriptor points to an AUFS branch name,
> >     >     > we replace it with a pathname from root.  Here is the call
> sequence from dump_filemap():
> >     >     >
> >     >     > dump_filemap()
> >     >     >    dump_one_reg_file(lfd)
> >     >     >       fill_fdlink(lfd)
> >     >     >          read_fd_link(lfd)
> >     >     >             readlink(lfd)               // returns path in
> branch
> >     >     >          fixup_aufs_brnach()     // replaces path in branch
> with path from root
> >     >     >       check_path_remap()
> >     >     >
> >     >     > In other words, when we see a pathname in a branch, we
> replace it with a pathname from root
> >     >     > as if we never saw the branch pathname.  When using a link
> file descriptor in fstat(), because
> >     >     > the kernel returns the stat info of the pathname in branch,
> we use stat() with a pathname
> >     >     > from the root instead of fstat().  If we didn't do this,
> we'd get different device/inode
> >     >     > values and CRIU fails with an error message like
> "Unaccessible path opened 33:23, need 2049:53764".
> >     >
> >     >     So if you stat() an fd opened on AUFS you get dev:inode pair
> from AUFS, not from the
> >     >     underlying ext4?
> >     >
> >     >
> >     > Exactly.  That's why we need to always stat the file either from
> the AUFS root or from the branch where
> >     > it really lives.  We cannot stat it from one path and expect to
> see the same dev:ino from a different path,
> >     > although it's really the same file!
> >
> >     So, if by stat()-ing a file we see "virtual", i.e. the AUFS's device
> and inode, then we
> >     can use this to distinguish AUFS files from non AUFS ones.
> >
> >     BTW, we have similar problem with btrfs -- it reports "wrong" device
> number in some
> >     places, so we have to use the phys_stat_dev_match() helper to
> compare devices, maybe
> >     something similar can be done with AUFS.
> >
> >
> > The right solution is to fix map_files.  Once map_files shows /foo/bar,
> instead of /path-to-branch/foo/bar,
> > we won't need any "fix up" code.
>
> If the answer to my above question "yes", then this is a bug from AUFS, not
> from map_files (since links show correct paths on most of the files).
>

Yes, I agree that this seems to be an AUFS bug but I haven't looked into
map_files and AUFS implementation to pinpoint the faulty code.

>
> >     OK, can we rework this part like this -- we add the .post_parse
> callback on the fstype,
> >     and once parse_mountinfo meets an fs with one set it puts the
> created mi into separate
> >     list and walks it, calling ->post_parse(), after all the mount-infos
> get collected.
> >
> >
> > Yes, this would work but the good news is that we no longer need the
> "reference" stuff (see below).
>
> Great! :)
>
> > Yes, I confirmed that "mnt: Don't delay external mount points" fixed the
> issue so I ripped
> > out all "reference" code (yay!).
>
> Cool!
>
> > Please review/test the attached patch set, which I rebased to head
> before sending.  Once the
> > final changes are made, I will send you a signed-off patch with a
> descriptive commit message.
>
> So far I have only four comments, please, find the inline.
>
>
> > @@ -197,6 +199,20 @@ int fill_fdlink(int lfd, const struct fd_parms *p,
> struct fd_link *link)
> >               return -1;
> >       }
> >
> > +     /*
> > +      * If the link file descriptor (lfd) points to a
> > +      * file in an AUFS branch, the pathname should be
> > +      * replaced with a pathname from root.
> > +      */
> > +     if (opts.aufs) {
> > +             int n = sizeof link->name - 1;
> > +             n = fixup_aufs_path(&link->name[1], n, true);
>
> Since you say, that we only see spoiled paths in map_files, do we need
> the path fixup here? Isn't the one performed in parsing the mappings
> not enough?
>

We need this logic here because the link->name is in an AUFS branch.  For
example, these are the paths that are passed to fill_fdlink() during my
test:

/dev/null
/var/lib/docker/aufs/diff/<ID>/bin/busybox
/var/lib/docker/aufs/diff/<ID>/lib/libuClibc-0.9.33.2.so
/var/lib/docker/aufs/diff/<ID>/lib/ld64-uClibc-0.9.33.2.so
/

> > +             if (n < 0)
> > +                     return -1;
> > +             if (n > 0)
> > +                     len = n;
> > +     }
> > +
> >       link->len = len + 1;
> >       return 0;
> >  }
>
> > +int fixup_aufs_path(char *path, int size, bool chop)
> > +{
> > +     char rpath[PATH_MAX];
> > +     int n;
> > +     int blen;
> > +
> > +     if (aufs_branches == NULL) {
> > +             pr_err("No aufs branches to search for %s\n", path);
> > +             return -1;
>
> Presumably, if we don't have AUFS we shouldn't fail :)
>

If we don't have AUFS, we should not specify --aufs and, therefore, we
should not be calling fixup_aufs_path().  So, if we get to
fixup_aufs_path() but don't have branch information, it's an error.

Alternatively, we can decide not to specify --aufs and call AUFS code
regardless, returning without error if there's no AUFS.  But I added --aufs
for performance reasons.  It's used in 3 places:

1. cr-dump.c/dump_one_task()
   - to collect AUFS branch pathnames and free them when done
2. files.c/fill_fdlink()
   - to replace pathnames in branches with pathnames from root
3. proc_parse.c/parse_smaps()
   - to get the path for stat() instead of fstat()

> > +     }
> > +
>
> > +int parse_mountinfo_aufs_sbinfo(pid_t pid, char *sbinfo, int len)
> > +{
> > +     char line[PATH_MAX];
> > +     char *fstype = NULL;
> > +     char *opt = NULL;
> > +     char *cp;
> > +     int n, ret;
> > +
> > +     n = get_mntent_by_mountpoint(pid, "/", line, sizeof line);
> > +     if (n < 0)
> > +             return -1;
> > +
>
> Presumably, this would get fixed by rework of mountinfo's parser, right?
>

I already reworked the callback in mountinfo's parser which was in the
previous patch (see proc_parse.c/aufs_parse()).  This is different as it's
not changing anything in the mountinfo entry.  It's just getting the sbinfo
string from mountinfo root entry to collect AUFS branch information.

dump_one_task()
   if (opts.aufs) parse_aufs_branches()
      parse_mountinf_aufs_sbinfo()
         get_mntent_by_mountpoint()

It may be possible to add another call back, .post_parse, and rework
mountinfo parsing.  But .post_parse will only be used by used by AUFS and
this code is temporary anyway.  It will go away as soon as AUFS bug is
fixed.  So, if you're OK, let's keep it like this.  For non-AUFS
filesystems, this code isn't even called.

> > +int aufs_parse(struct mount_info *new)
> > +{
> > +     char *cp;
> > +
> > +     if (!opts.aufs_root || strcmp(new->mountpoint, "./"))
>
> If we've met AUFS, but no --aufs-root specified, shouldn't we fail the
> dump?
>

The reason it doesn't return an error is that as soon as the mountinfo
entry for root is fixed (i.e., shows the root pathname instead of just /),
we won't have to specify --aufs-root anymore and shouldn't fail in
aufs_parse().

>
> > +             return 0;
> > +
> > +     cp = malloc(strlen(opts.aufs_root) + 1);
> > +     if (!cp) {
> > +             pr_err("Cannot allocate memory for %s\n", opts.aufs_root);
> > +             return -1;
> > +     }
> > +     strcpy(cp, opts.aufs_root);
> > +
> > +     pr_debug("Replacing %s with %s\n", new->root, opts.aufs_root);
> > +     free(new->root);
> > +     new->root = cp;
>
> What if we have more than one AUFS mountpoint with different roots? I'm
> OK if we only support one, but then we should fail the 2nd mountpoint met.
>

Here, we're replacing the root of the container which by definition is only
1.  But if different processes in the process tree have different AUFS
mounts, their respective branch pathnames will be handed one by one in
parse_aufs_branches() when called from dump_one_task().

>
> > +
> > +     return 0;
> > +}
>
>
> Thanks,
> Pavel
>
>
Thanks again for your feedback.  I hope the answers are clear.  If it's
easier or more productive, I am available on video chat to discuss in
greater detail.  My schedule is flexible, so time difference shouldn't be a
problem.

Cheers!

--Saied
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20140818/aaac76bd/attachment-0001.html>