[CRIU] AUFS Support in CRIU

Saied Kazemi saied at google.com
Fri Aug 15 15:12:43 PDT 2014


On Fri, Aug 15, 2014 at 2:48 AM, Pavel Emelyanov <xemul at parallels.com>
wrote:

> On 08/15/2014 01:54 AM, Saied Kazemi wrote:
> >
> >
> >
> > On Thu, Aug 14, 2014 at 1:01 PM, Pavel Emelyanov <xemul at parallels.com
> <mailto:xemul at parallels.com>> wrote:
> >
> >     On 08/14/2014 11:22 PM, Saied Kazemi wrote:
> >     >
> >     >
> >     >
> >     > On Wed, Aug 13, 2014 at 1:13 AM, Pavel Emelyanov <
> xemul at parallels.com <mailto:xemul at parallels.com> <mailto:
> xemul at parallels.com <mailto:xemul at parallels.com>>> wrote:
> >     >
> >     >     On 08/13/2014 04:29 AM, Saied Kazemi wrote:
> >     >
> >     >     Hi, Saied.
> >     >
> >     >     > Here is a new patch.  I rebased to the head (commit
> ded04267f8) and cleaned up the code a
> >     >     > bit more.  Please use the attached patch instead of the one
> I sent yesterday.
> >     >     >
> >     >     > Note that I had to delete 3 lines from cgroup.c for
> properties that I don't have on my
> >     >     > Ubuntu 14.04.  Also, please note that you have to add
> --manage-cgroups to criu command
> >     >     > line for both dump and restore.
> >     >
> >     >     Thank you for looking into this. I'm not familiar with AUFS at
> all, thus I
> >     >     only have comments about the way AUFS parsing/fixing code is
> integrated into
> >     >     the rest of the criu code.
> >     >
> >     >
> >     > I am not familiar with AUFS internals either.  I just looked at
> how it's set up and what information it passes through /proc.
> >     >
> >     >
> >     >
> >     >
> >     >     Please, see my comments inline.
> >     >
> >     >     > @@ -197,6 +199,19 @@ int fill_fdlink(int lfd, const struct
> fd_parms *p, struct fd_link *link)
> >     >     >               return -1;
> >     >     >       }
> >     >     >
> >     >     > +     /*
> >     >     > +      * For AUFS support, we need to replace absolute
> >     >     > +      * branch pathnames with relative pathnames from root.
> >     >     > +      */
> >     >     > +     if (opts.aufs) {
> >     >
> >     >     At this point we have the statfs() information. Can we rely on
> the
> >     >     fs_type being aufs magic to check that the file is aufs one?
> >     >
> >     >
> >     > No, because depending on how we end up in fill_fdlink(), p->fstype
> may be NULL.
> >     > Besides, even when set, its fstyp isn't aufs.  It'd be something
> like ext4 because
> >     > the branch is in ext4.
> >
> >     Damn :( So it's not a "real" filesystem, it just redirects all calls
> to some
> >     other, underlying one, right? And if I opened a file which is on
> AUFS mount,
> >     I will end up in some other FS.
> >
> >
> > Yes, it's not a "real" filesystem like ext4 but this can actually be a
> feature because
> > you can stack different layers from different filesystems to present a
> consolidated fs
> > view to the user.
> >
> >
> >
> >     BTW, can we re-use the mnt_id for that? If I call open() on some
> path that
> >     is on AUFS mount point, would the mnt_id of a file be the AUFS's one?
> >
> >
> > No, we cannot use mnt_id because it's uninitialized (-1).  Even if it
> were initialized, I
> > am not sure mnt_id alone would suffice to identify AUFS pathnames that
> should be replaced.
>
> It's uninitialized, because CRIU hasn't read one in. I mean -- if we look
> at the fdinfo
> data of a file opened on AUFS, would the mnt_id value be the one from AUFS
> mount, or from
> the underlying FS mount?
>

I get -1 because my system, a vanilla Ubuntu 14.04 (kernel 3.13.0), does
not have mnt_id in fdinfo.  So it's not safe to rely on mnt_id.



>
> > Also, please note that we're handling the files that the process is
> executing (e.g.,
> > /bin/busybox, /lib/libuClibc-0.9.33.2.so <http://libuClibc-0.9.33.2.so>,
> > /lib/ld64-uClibc-0.9.33.2.so <http://ld64-uClibc-0.9.33.2.so>), not the
> files that the
> > process has opened.  When we get to fill_fdlink() for these files,
> mnt_id is -1.
> >
> >
> >
> >     >     > +             int n = sizeof link->name - 1;
> >     >     > +             n = fixup_aufs_path(&link->name[1], n, true);
> >     >     > +             if (n < 0)
> >     >     > +                     return -1;
> >     >     > +             if (n > 0)
> >     >     > +                     len = n;
> >     >     > +     }
> >     >     > +
> >     >     >       link->len = len + 1;
> >     >     >       return 0;
> >     >
> >     >     > @@ -450,12 +453,24 @@ int parse_smaps(pid_t pid, struct
> vm_area_list *vma_area_list, bool use_map_file
> >     >     >                       vma_area->st = prev->st;
> >     >     >               } else if (vma_area->vm_file_fd >= 0) {
> >     >     >                       struct stat *st_buf;
> >     >     > +                     char *f;
> >     >     >
> >     >     >                       st_buf = vma_area->st =
> xmalloc(sizeof(*st_buf));
> >     >     >                       if (!st_buf)
> >     >     >                               goto err;
> >     >     >
> >     >     > -                     if (fstat(vma_area->vm_file_fd,
> st_buf) < 0) {
> >     >     > +                     /*
> >     >     > +                      * For AUFS support, we cannot fstat()
> file a descriptor that
> >     >     > +                      * is a symbolic link to a branch.
> Instead, we obtain the
> >     >     > +                      * pathname of the file from the root
> and use stat().
> >     >     > +                      */
> >     >     > +                     if (opts.aufs && (f =
> fixup_aufs_fd_path(vma_area->vm_file_fd))) {
> >     >
> >     >     Would this work if the vm_file_fd sits in another mount
> namespace? Or
> >     >     the path reported by fixup_aufs_fd_path() is always in the
> criu's one?
> >     >
> >     >
> >     > I think it should work, although I don't know how to force it in a
> different mount namespace in
> >     > the tests that I am running.  Can you whip up an example?
> >
> >     If you start a Docker container, all tasks in it will automatically
> happen
> >     in a mount namespace. You can check this by comparing where the
> /proc/pid/ns/mnt
> >     links point for a task in container and for e.g. your shell outside
> of it.
> >
> >
> > Yes, when you do "docker run -i busybox:latest /bin/sh -i", /bin/sh
> comes from the container's mount
> > namespace.  I am not sure how to get /bin/sh from a mount namespace
> other than the container's.
>
> I mean -- if we meet the /aufs_branch/foo/bar file in /proc/pid/map_files,
> is this path
> we see visible from our mount namespace, or from the container's mount
> namespace?
>

The /aufs_branch is not visible from the container's mount namespace;
/foo/bar is.



>
> >
> >
> >     >     And another question -- do we need the fixed stat buffer that
> early?
> >     >
> >     >
> >     > I am doing this exactly at the same time that CRIU is doing it
> (i.e., the fstat() is original code).
> >     > I'm just using stat() instead of fstat(), so effectively it's the
> same logic/time as before.
> >
> >     OK. BTW, you say, that /proc/pid/map_files/ links point to
> "branches" paths.
> >     How about the /proc/pid/fd/ ones? Do they show paths from branches
> too?
> >
> >
> > No, symlinks in /proc/pid/fd/ point to files relative to the root.
>
> I.e. they are seen as if they are already fixup-ed by your code? I mean --
> if I open /etc/hosts
> and map /etc/hosts from container, would I see /etc/hosts in /proc/pid/fd
> and /val/lib/.../etc/hosts
> in /proc/pid/map_files?
>

You would see /etc/hosts, which is correct.

# ls -l /proc/16958/fd
total 0
lr-x------ 1 root root 64 Aug 15 14:08 0 -> pipe:[96434]
l-wx------ 1 root root 64 Aug 15 14:08 1 -> pipe:[96435]
l-wx------ 1 root root 64 Aug 15 14:08 2 -> pipe:[96436]
lr-x------ 1 root root 64 Aug 15 14:08 3 -> /etc/hosts



>
> And another thing -- if we do open("/proc/pid/map_files/<something>") what
> would we see in the
> newly appeared /proc/pid/fd/$fd ?
>
> If /proc/pid/fd reveals "correct" paths, i.e. the /etc/hosts one, while
> /proc/pid/map_files/ show
> "spoiled" paths, i.e. -- the ones from branches, then this is a kernel bug
> -- both directories
> should behave the same way.
>

You would see the correct path, /etc/hosts.

# ls -l /proc/16958/map_files
total 0
lr-------- 1 root root 64 Aug 15 14:09 400000-4c0000 ->
/var/lib/docker/aufs/diff/<ID>/a.out
lr-------- 1 root root 64 Aug 15 14:09 6bf000-6c2000 ->
/var/lib/docker/aufs/diff/<ID>/a.out
lr-------- 1 root root 64 Aug 15 14:09 7f8f30eb7000-7f8f30eb8000 ->
/etc/hosts

It's important to note that while both /proc/<PID>/fd and
/proc/<PID>/map_files show the same correct path, as you see above, the
executable file (a.out) shows the path within the AUFS branch which is
wrong.  AUFS should not reveal its internals.  This also contradicts what
smaps shows (see below), hence the need to "fix up" branch paths that are
obtained by readlink()ing map_files entries.

# grep '^[0-9a-f]' /proc/16958/smaps
00400000-004c0000 r-xp 00000000 00:25 62
/a.out
006bf000-006c2000 rw-p 000bf000 00:25 62
/a.out
006c2000-006c5000 rw-p 00000000 00:00 0
02405000-02428000 rw-p 00000000 00:00 0
 [heap]
7f8f30eb7000-7f8f30eb8000 r--p 00000000 08:01 311525
/etc/hosts
7f8f30eb8000-7f8f30eb9000 rw-p 00000000 00:00 0
7fff457f5000-7fff45816000 rw-p 00000000 00:00 0
 [stack]
7fff45980000-7fff45982000 r-xp 00000000 00:00 0
 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0
 [vsyscall]



>
> >
> >
> >     >     The vma fd is dumped in dump_filemap() and in that place we
> call
> >     >     get_fd_mntid() for missing mount ID. Can we fixup the device
> there too?
> >     >
> >     >
> >     > The way I handle it is like this:  If the link file descriptor
> points to an AUFS branch name,
> >     > we replace it with a pathname from root.  Here is the call
> sequence from dump_filemap():
> >     >
> >     > dump_filemap()
> >     >    dump_one_reg_file(lfd)
> >     >       fill_fdlink(lfd)
> >     >          read_fd_link(lfd)
> >     >             readlink(lfd)               // returns path in branch
> >     >          fixup_aufs_brnach()     // replaces path in branch with
> path from root
> >     >       check_path_remap()
> >     >
> >     > In other words, when we see a pathname in a branch, we replace it
> with a pathname from root
> >     > as if we never saw the branch pathname.  When using a link file
> descriptor in fstat(), because
> >     > the kernel returns the stat info of the pathname in branch, we use
> stat() with a pathname
> >     > from the root instead of fstat().  If we didn't do this, we'd get
> different device/inode
> >     > values and CRIU fails with an error message like "Unaccessible
> path opened 33:23, need 2049:53764".
> >
> >     So if you stat() an fd opened on AUFS you get dev:inode pair from
> AUFS, not from the
> >     underlying ext4?
> >
> >
> > Exactly.  That's why we need to always stat the file either from the
> AUFS root or from the branch where
> > it really lives.  We cannot stat it from one path and expect to see the
> same dev:ino from a different path,
> > although it's really the same file!
>
> So, if by stat()-ing a file we see "virtual", i.e. the AUFS's device and
> inode, then we
> can use this to distinguish AUFS files from non AUFS ones.
>
> BTW, we have similar problem with btrfs -- it reports "wrong" device
> number in some
> places, so we have to use the phys_stat_dev_match() helper to compare
> devices, maybe
> something similar can be done with AUFS.
>

The right solution is to fix map_files.  Once map_files shows /foo/bar,
instead of /path-to-branch/foo/bar, we won't need any "fix up" code.



>
> >     >     > +                             if (stat(f, st_buf) < 0) {
> >     >     > +                                     pr_perror("Failed
> fstat on %d's map %lu (%s)",
> >     >     > +                                             pid, start, f);
> >     >     > +                                     goto err;
> >     >     > +                             }
> >     >     > +                     } else if (fstat(vma_area->vm_file_fd,
> st_buf) < 0) {
> >     >     >                               pr_perror("Failed fstat on
> %d's map %lu", pid, start);
> >     >     >                               goto err;
> >     >     >                       }
> >     >
> >     >     > @@ -921,6 +942,16 @@ static int parse_mountinfo_ent(char
> *str, struct mount_info *new)
> >     >     >       if (ret != 3)
> >     >     >               return -1;
> >     >     >
> >     >     > +     /* see comments in sysfs_parse.c */
> >     >     > +     if (opts.aufs && !strcmp(new->mountpoint, "./")) {
> >     >     > +             if (strcmp(fstype, "aufs")) {
> >     >
> >     >     Can we reuse the fstype->parse() callback for this? This one
> gets called
> >     >     in parse_mountinfo() in a loop after parse_mountinfo_ent().
> >     >
> >     >
> >     > Yes, we can and I changed the code accordingly (added aufs to
> fstypes[]).  Please use the new patch.
> >
> >     OK.
> >
> >     >     > +                     pr_err("Expected fstype aufs got
> %s\n", fstype);
> >     >     > +                     return -1;
> >     >     > +             }
> >     >     > +             if (fixup_src_opt(&new->source, &opt) < 0)
> >     >     > +                     return -1;
> >     >     > +     }
> >     >     > +
> >     >     >       ret = -1;
> >     >     >       new->fstype = find_fstype_by_name(fstype);
> >     >     >
> >     >
> >     >     > +/*
> >     >     > + * Copy the line in process's mountinfo file that
> corresponds
> >     >     > + * to the mountpoint specified by the mntpoint argument.
> Return
> >     >     > + * the number of characters parsed in the line, or -1 on
> error.
> >     >     > + */
> >     >     > +int get_mountinfo_by_mountpoint(pid_t pid, char *mntpoint,
> char *line, int linelen)
> >     >
> >     >     I do not fully understand why this mountpoint->something-else
> parser
> >     >     is required. Can we put all the code parsing aufs-specific
> stuff into
> >     >     the fstype->parse() callback I mentioned above? When criu
> starts
> >     >     parsing mount namespaces (it does this before doing anything
> else)
> >     >     this info will be parsed and stored into mount_info->private.
> Later,
> >     >     when we need to e.g. fixup file paths we can get file->mnt_id,
> get
> >     >     mount_info by it, then check the fs being aufs and fix the path
> >     >     according to the mount_info->private data we have. Would this
> work?
> >     >
> >     >
> >     > I am now using the parse callback that you suggested but we need
> the "reference" values
> >     > *before* parse_mountinfo() is called.  Otherwise, when we call the
> parse callback, we
> >     > won't know what values to replace with.
> >
> >     Ah, I see (I think). So by the time you get to parsing the AUFS's
> entry,
> >     you need to know the mount infos from other mountpoints which may
> not yet
> >     be parsed, am I right with that?
> >
> >
> > You're right.  That's why we call parse_aufs_reference() in
> collect_mntinfo() before calling parse_mountinfo().
>
> OK, can we rework this part like this -- we add the .post_parse callback
> on the fstype,
> and once parse_mountinfo meets an fs with one set it puts the created mi
> into separate
> list and walks it, calling ->post_parse(), after all the mount-infos get
> collected.
>

Yes, this would work but the good news is that we no longer need the
"reference" stuff (see below).



> >
> >
> >
> >     > That said, I noticed that with my latest rebase there is
> apparently no longer a need to
> >     > specify --aufs-ref.  This is great as we can get rid of the code
> fixing up the mountinfo
> >     > root entry (note that we'd still need to replace / with the
> --aufs-root).
> >     >
> >     > Since I am not familiar with the internals of CRIU as to how it
> builds its bind mount
> >     > list, could you please review this part of the code?  As mentioned
> before, it used to
> >     > be that without --aufs-ref, mounts_equal() would return false, in
> my test cases, for
> >     > /etc/hosts, /etc/hostname, and /etc/resolv.conf because their dev,
> fstype, source, and
> >     > options were different.
> >
> >     So, the /etc/hosts is a mountpoint, whose root (the original file)
> is somewhere
> >     outside of the container, right? And the /etc/hosts file itself
> (i.e. -- before
> >     someone bind-mounts the external file on it) resides on AUFS, am I
> correct?
> >
> >
> > Yes to both questions.  You can see where /etc/hosts lives from the
> --ext-mount-map option.
>
> OK, thanks for explanation.
>
> >     > As a result, these files would not be added to the mnt_bind
> >     > list in collect_shared() and restore would fail with the error
> message "A few mount
> >     > points can't be mounted."  I am puzzled that restore is now
> successful without --aufs-ref.
> >     > I need to look into this deeper myself but would really appreciate
> your thoughts and feedback.
> >
> >     I'll try :) Can you shed more light on the comment where you say
> that you
> >     replace device numbers, sources and options. AFAIU once you have
> AUFS's
> >     device, source and opt, you replace it with the underlying "branch"'s
> >     device, source and opt, do I get it correctly?
> >
> >
> > When I investigated the cause of "A few mount points can't be mounted"
> during restore, I noticed
> > that external bind mounts were not added to the mnt_bind list in
> collect_shared() because
> > mounts_equal() was never true for these files when their mountinfo entry
> was compared with the root mountinfo entry.
> >
> >                 /* Search bind-mounts */
> >                 if (list_empty(&m->mnt_bind)) {
> >                         /*
> >                          * A first mounted point will be set up as a
> source point
> >                          * for others. Look at propagate_mount()
> >                          */
> >                         for (t = m->next; t; t = t->next) {
> >                                 if (mounts_equal(m, t, true))
> >                                         list_add(&t->mnt_bind,
> &m->mnt_bind);
> >                         }
> >                 }
> >
> > So, I "fixed up" the root mountinfo entry to have the same values as
> external bind mounts, effectively
> > forcing a match in mounts_equal(), and restore worked.
> >
> > That said, this is not a good solution at all as it only works when all
> external bind mounts come from
> > the same device and fiilesystem.  But you can potentially have multiple
> external bind mounts each coming
> > from a different device and filesystem, with different options.  I need
> your help in addressing this issue.
> >
> > Finally, to complicate (or simplify?) things, after my latest rebase, I
> noticed that I don't have to fix
> > up the root mountinfo entry anymore to get the external bind mounts
> mounted!!  Somehow, all external bind
> > mounts make it to the mnt_bind list.  I do not know how to explain this
> because obviously mounts_equal()
> > returns false if any of devices, filesystems, sources, or options are
> different.
> >
> >         if (mi->s_dev != c->s_dev ||
> >             c->fstype != mi->fstype ||
> >             strcmp(c->source, mi->source) ||
> >             strcmp(c->options, mi->options))
> >                 return false;
> >
> > Could it be related to some of the recent mount-related commits?
> > For example, "mnt: Don't delay external mount points" 8b019e0bb4?
>
> For external bind mounts -- yes, it can. Can you check this? But I have a
> feeling
> that I don't understand the problem with mounts correctly.
>

Yes, I confirmed that "mnt: Don't delay external mount points" fixed the
issue so I ripped out all "reference" code (yay!).

Please review/test the attached patch set, which I rebased to head before
sending.  Once the final changes are made, I will send you a signed-off
patch with a descriptive commit message.

Cheers!

--Saied
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20140815/34a2a41f/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: aufs-4.patch
Type: text/x-patch
Size: 16436 bytes
Desc: not available
URL: <http://lists.openvz.org/pipermail/criu/attachments/20140815/34a2a41f/attachment-0001.bin>


More information about the CRIU mailing list