[CRIU] AUFS Support in CRIU

Fri Aug 15 02:48:52 PDT 2014

On 08/15/2014 01:54 AM, Saied Kazemi wrote:
> 
> 
> 
> On Thu, Aug 14, 2014 at 1:01 PM, Pavel Emelyanov <xemul at parallels.com <mailto:xemul at parallels.com>> wrote:
> 
>     On 08/14/2014 11:22 PM, Saied Kazemi wrote:
>     >
>     >
>     >
>     > On Wed, Aug 13, 2014 at 1:13 AM, Pavel Emelyanov <xemul at parallels.com <mailto:xemul at parallels.com> <mailto:xemul at parallels.com <mailto:xemul at parallels.com>>> wrote:
>     >
>     >     On 08/13/2014 04:29 AM, Saied Kazemi wrote:
>     >
>     >     Hi, Saied.
>     >
>     >     > Here is a new patch.  I rebased to the head (commit ded04267f8) and cleaned up the code a
>     >     > bit more.  Please use the attached patch instead of the one I sent yesterday.
>     >     >
>     >     > Note that I had to delete 3 lines from cgroup.c for properties that I don't have on my
>     >     > Ubuntu 14.04.  Also, please note that you have to add --manage-cgroups to criu command
>     >     > line for both dump and restore.
>     >
>     >     Thank you for looking into this. I'm not familiar with AUFS at all, thus I
>     >     only have comments about the way AUFS parsing/fixing code is integrated into
>     >     the rest of the criu code.
>     >
>     >
>     > I am not familiar with AUFS internals either.  I just looked at how it's set up and what information it passes through /proc.
>     >
>     >
>     >
>     >
>     >     Please, see my comments inline.
>     >
>     >     > @@ -197,6 +199,19 @@ int fill_fdlink(int lfd, const struct fd_parms *p, struct fd_link *link)
>     >     >               return -1;
>     >     >       }
>     >     >
>     >     > +     /*
>     >     > +      * For AUFS support, we need to replace absolute
>     >     > +      * branch pathnames with relative pathnames from root.
>     >     > +      */
>     >     > +     if (opts.aufs) {
>     >
>     >     At this point we have the statfs() information. Can we rely on the
>     >     fs_type being aufs magic to check that the file is aufs one?
>     >
>     >
>     > No, because depending on how we end up in fill_fdlink(), p->fstype may be NULL.
>     > Besides, even when set, its fstyp isn't aufs.  It'd be something like ext4 because
>     > the branch is in ext4.
> 
>     Damn :( So it's not a "real" filesystem, it just redirects all calls to some
>     other, underlying one, right? And if I opened a file which is on AUFS mount,
>     I will end up in some other FS.
> 
> 
> Yes, it's not a "real" filesystem like ext4 but this can actually be a feature because 
> you can stack different layers from different filesystems to present a consolidated fs
> view to the user.
> 
> 
> 
>     BTW, can we re-use the mnt_id for that? If I call open() on some path that
>     is on AUFS mount point, would the mnt_id of a file be the AUFS's one?
> 
> 
> No, we cannot use mnt_id because it's uninitialized (-1).  Even if it were initialized, I
> am not sure mnt_id alone would suffice to identify AUFS pathnames that should be replaced.

It's uninitialized, because CRIU hasn't read one in. I mean -- if we look at the fdinfo
data of a file opened on AUFS, would the mnt_id value be the one from AUFS mount, or from 
the underlying FS mount?

> Also, please note that we're handling the files that the process is executing (e.g., 
> /bin/busybox, /lib/libuClibc-0.9.33.2.so <http://libuClibc-0.9.33.2.so>,
> /lib/ld64-uClibc-0.9.33.2.so <http://ld64-uClibc-0.9.33.2.so>), not the files that the
> process has opened.  When we get to fill_fdlink() for these files, mnt_id is -1.
> 
> 
> 
>     >     > +             int n = sizeof link->name - 1;
>     >     > +             n = fixup_aufs_path(&link->name[1], n, true);
>     >     > +             if (n < 0)
>     >     > +                     return -1;
>     >     > +             if (n > 0)
>     >     > +                     len = n;
>     >     > +     }
>     >     > +
>     >     >       link->len = len + 1;
>     >     >       return 0;
>     >
>     >     > @@ -450,12 +453,24 @@ int parse_smaps(pid_t pid, struct vm_area_list *vma_area_list, bool use_map_file
>     >     >                       vma_area->st = prev->st;
>     >     >               } else if (vma_area->vm_file_fd >= 0) {
>     >     >                       struct stat *st_buf;
>     >     > +                     char *f;
>     >     >
>     >     >                       st_buf = vma_area->st = xmalloc(sizeof(*st_buf));
>     >     >                       if (!st_buf)
>     >     >                               goto err;
>     >     >
>     >     > -                     if (fstat(vma_area->vm_file_fd, st_buf) < 0) {
>     >     > +                     /*
>     >     > +                      * For AUFS support, we cannot fstat() file a descriptor that
>     >     > +                      * is a symbolic link to a branch.  Instead, we obtain the
>     >     > +                      * pathname of the file from the root and use stat().
>     >     > +                      */
>     >     > +                     if (opts.aufs && (f = fixup_aufs_fd_path(vma_area->vm_file_fd))) {
>     >
>     >     Would this work if the vm_file_fd sits in another mount namespace? Or
>     >     the path reported by fixup_aufs_fd_path() is always in the criu's one?
>     >
>     >
>     > I think it should work, although I don't know how to force it in a different mount namespace in
>     > the tests that I am running.  Can you whip up an example?
> 
>     If you start a Docker container, all tasks in it will automatically happen
>     in a mount namespace. You can check this by comparing where the /proc/pid/ns/mnt
>     links point for a task in container and for e.g. your shell outside of it.
> 
> 
> Yes, when you do "docker run -i busybox:latest /bin/sh -i", /bin/sh comes from the container's mount 
> namespace.  I am not sure how to get /bin/sh from a mount namespace other than the container's.

I mean -- if we meet the /aufs_branch/foo/bar file in /proc/pid/map_files, is this path
we see visible from our mount namespace, or from the container's mount namespace?

> 
> 
>     >     And another question -- do we need the fixed stat buffer that early?
>     >
>     >
>     > I am doing this exactly at the same time that CRIU is doing it (i.e., the fstat() is original code).
>     > I'm just using stat() instead of fstat(), so effectively it's the same logic/time as before.
> 
>     OK. BTW, you say, that /proc/pid/map_files/ links point to "branches" paths.
>     How about the /proc/pid/fd/ ones? Do they show paths from branches too?
> 
> 
> No, symlinks in /proc/pid/fd/ point to files relative to the root.

I.e. they are seen as if they are already fixup-ed by your code? I mean -- if I open /etc/hosts
and map /etc/hosts from container, would I see /etc/hosts in /proc/pid/fd and /val/lib/.../etc/hosts
in /proc/pid/map_files?

And another thing -- if we do open("/proc/pid/map_files/<something>") what would we see in the
newly appeared /proc/pid/fd/$fd ?

If /proc/pid/fd reveals "correct" paths, i.e. the /etc/hosts one, while /proc/pid/map_files/ show
"spoiled" paths, i.e. -- the ones from branches, then this is a kernel bug -- both directories 
should behave the same way.

> 
> 
>     >     The vma fd is dumped in dump_filemap() and in that place we call
>     >     get_fd_mntid() for missing mount ID. Can we fixup the device there too?
>     >
>     >
>     > The way I handle it is like this:  If the link file descriptor points to an AUFS branch name,
>     > we replace it with a pathname from root.  Here is the call sequence from dump_filemap():
>     >
>     > dump_filemap()
>     >    dump_one_reg_file(lfd)
>     >       fill_fdlink(lfd)
>     >          read_fd_link(lfd)
>     >             readlink(lfd)               // returns path in branch
>     >          fixup_aufs_brnach()     // replaces path in branch with path from root
>     >       check_path_remap()
>     >
>     > In other words, when we see a pathname in a branch, we replace it with a pathname from root
>     > as if we never saw the branch pathname.  When using a link file descriptor in fstat(), because
>     > the kernel returns the stat info of the pathname in branch, we use stat() with a pathname
>     > from the root instead of fstat().  If we didn't do this, we'd get different device/inode
>     > values and CRIU fails with an error message like "Unaccessible path opened 33:23, need 2049:53764".
> 
>     So if you stat() an fd opened on AUFS you get dev:inode pair from AUFS, not from the
>     underlying ext4?
> 
> 
> Exactly.  That's why we need to always stat the file either from the AUFS root or from the branch where
> it really lives.  We cannot stat it from one path and expect to see the same dev:ino from a different path,
> although it's really the same file!

So, if by stat()-ing a file we see "virtual", i.e. the AUFS's device and inode, then we
can use this to distinguish AUFS files from non AUFS ones.

BTW, we have similar problem with btrfs -- it reports "wrong" device number in some
places, so we have to use the phys_stat_dev_match() helper to compare devices, maybe
something similar can be done with AUFS.

>     >     > +                             if (stat(f, st_buf) < 0) {
>     >     > +                                     pr_perror("Failed fstat on %d's map %lu (%s)",
>     >     > +                                             pid, start, f);
>     >     > +                                     goto err;
>     >     > +                             }
>     >     > +                     } else if (fstat(vma_area->vm_file_fd, st_buf) < 0) {
>     >     >                               pr_perror("Failed fstat on %d's map %lu", pid, start);
>     >     >                               goto err;
>     >     >                       }
>     >
>     >     > @@ -921,6 +942,16 @@ static int parse_mountinfo_ent(char *str, struct mount_info *new)
>     >     >       if (ret != 3)
>     >     >               return -1;
>     >     >
>     >     > +     /* see comments in sysfs_parse.c */
>     >     > +     if (opts.aufs && !strcmp(new->mountpoint, "./")) {
>     >     > +             if (strcmp(fstype, "aufs")) {
>     >
>     >     Can we reuse the fstype->parse() callback for this? This one gets called
>     >     in parse_mountinfo() in a loop after parse_mountinfo_ent().
>     >
>     >
>     > Yes, we can and I changed the code accordingly (added aufs to fstypes[]).  Please use the new patch.
> 
>     OK.
> 
>     >     > +                     pr_err("Expected fstype aufs got %s\n", fstype);
>     >     > +                     return -1;
>     >     > +             }
>     >     > +             if (fixup_src_opt(&new->source, &opt) < 0)
>     >     > +                     return -1;
>     >     > +     }
>     >     > +
>     >     >       ret = -1;
>     >     >       new->fstype = find_fstype_by_name(fstype);
>     >     >
>     >
>     >     > +/*
>     >     > + * Copy the line in process's mountinfo file that corresponds
>     >     > + * to the mountpoint specified by the mntpoint argument.  Return
>     >     > + * the number of characters parsed in the line, or -1 on error.
>     >     > + */
>     >     > +int get_mountinfo_by_mountpoint(pid_t pid, char *mntpoint, char *line, int linelen)
>     >
>     >     I do not fully understand why this mountpoint->something-else parser
>     >     is required. Can we put all the code parsing aufs-specific stuff into
>     >     the fstype->parse() callback I mentioned above? When criu starts
>     >     parsing mount namespaces (it does this before doing anything else)
>     >     this info will be parsed and stored into mount_info->private. Later,
>     >     when we need to e.g. fixup file paths we can get file->mnt_id, get
>     >     mount_info by it, then check the fs being aufs and fix the path
>     >     according to the mount_info->private data we have. Would this work?
>     >
>     >
>     > I am now using the parse callback that you suggested but we need the "reference" values
>     > *before* parse_mountinfo() is called.  Otherwise, when we call the parse callback, we
>     > won't know what values to replace with.
> 
>     Ah, I see (I think). So by the time you get to parsing the AUFS's entry,
>     you need to know the mount infos from other mountpoints which may not yet
>     be parsed, am I right with that?
> 
> 
> You're right.  That's why we call parse_aufs_reference() in collect_mntinfo() before calling parse_mountinfo().

OK, can we rework this part like this -- we add the .post_parse callback on the fstype,
and once parse_mountinfo meets an fs with one set it puts the created mi into separate
list and walks it, calling ->post_parse(), after all the mount-infos get collected.

>  
> 
> 
>     > That said, I noticed that with my latest rebase there is apparently no longer a need to
>     > specify --aufs-ref.  This is great as we can get rid of the code fixing up the mountinfo
>     > root entry (note that we'd still need to replace / with the --aufs-root).
>     >
>     > Since I am not familiar with the internals of CRIU as to how it builds its bind mount
>     > list, could you please review this part of the code?  As mentioned before, it used to
>     > be that without --aufs-ref, mounts_equal() would return false, in my test cases, for
>     > /etc/hosts, /etc/hostname, and /etc/resolv.conf because their dev, fstype, source, and
>     > options were different.
> 
>     So, the /etc/hosts is a mountpoint, whose root (the original file) is somewhere
>     outside of the container, right? And the /etc/hosts file itself (i.e. -- before
>     someone bind-mounts the external file on it) resides on AUFS, am I correct?
> 
> 
> Yes to both questions.  You can see where /etc/hosts lives from the --ext-mount-map option.

OK, thanks for explanation.

>     > As a result, these files would not be added to the mnt_bind
>     > list in collect_shared() and restore would fail with the error message "A few mount
>     > points can't be mounted."  I am puzzled that restore is now successful without --aufs-ref.
>     > I need to look into this deeper myself but would really appreciate your thoughts and feedback.
> 
>     I'll try :) Can you shed more light on the comment where you say that you
>     replace device numbers, sources and options. AFAIU once you have AUFS's
>     device, source and opt, you replace it with the underlying "branch"'s
>     device, source and opt, do I get it correctly?
> 
> 
> When I investigated the cause of "A few mount points can't be mounted" during restore, I noticed
> that external bind mounts were not added to the mnt_bind list in collect_shared() because 
> mounts_equal() was never true for these files when their mountinfo entry was compared with the root mountinfo entry.
> 
>                 /* Search bind-mounts */
>                 if (list_empty(&m->mnt_bind)) {
>                         /*
>                          * A first mounted point will be set up as a source point
>                          * for others. Look at propagate_mount()
>                          */
>                         for (t = m->next; t; t = t->next) {
>                                 if (mounts_equal(m, t, true))
>                                         list_add(&t->mnt_bind, &m->mnt_bind);
>                         }
>                 }
> 
> So, I "fixed up" the root mountinfo entry to have the same values as external bind mounts, effectively
> forcing a match in mounts_equal(), and restore worked.
> 
> That said, this is not a good solution at all as it only works when all external bind mounts come from
> the same device and fiilesystem.  But you can potentially have multiple external bind mounts each coming
> from a different device and filesystem, with different options.  I need your help in addressing this issue.
> 
> Finally, to complicate (or simplify?) things, after my latest rebase, I noticed that I don't have to fix
> up the root mountinfo entry anymore to get the external bind mounts mounted!!  Somehow, all external bind
> mounts make it to the mnt_bind list.  I do not know how to explain this because obviously mounts_equal() 
> returns false if any of devices, filesystems, sources, or options are different.
> 
>         if (mi->s_dev != c->s_dev ||
>             c->fstype != mi->fstype ||
>             strcmp(c->source, mi->source) ||
>             strcmp(c->options, mi->options))
>                 return false;
>  
> Could it be related to some of the recent mount-related commits?
> For example, "mnt: Don't delay external mount points" 8b019e0bb4?

For external bind mounts -- yes, it can. Can you check this? But I have a feeling
that I don't understand the problem with mounts correctly.

Thanks,
Pavel