[CRIU] AUFS Support in CRIU
Pavel Emelyanov
xemul at parallels.com
Fri Aug 15 02:48:52 PDT 2014
On 08/15/2014 01:54 AM, Saied Kazemi wrote:
>
>
>
> On Thu, Aug 14, 2014 at 1:01 PM, Pavel Emelyanov <xemul at parallels.com <mailto:xemul at parallels.com>> wrote:
>
> On 08/14/2014 11:22 PM, Saied Kazemi wrote:
> >
> >
> >
> > On Wed, Aug 13, 2014 at 1:13 AM, Pavel Emelyanov <xemul at parallels.com <mailto:xemul at parallels.com> <mailto:xemul at parallels.com <mailto:xemul at parallels.com>>> wrote:
> >
> > On 08/13/2014 04:29 AM, Saied Kazemi wrote:
> >
> > Hi, Saied.
> >
> > > Here is a new patch. I rebased to the head (commit ded04267f8) and cleaned up the code a
> > > bit more. Please use the attached patch instead of the one I sent yesterday.
> > >
> > > Note that I had to delete 3 lines from cgroup.c for properties that I don't have on my
> > > Ubuntu 14.04. Also, please note that you have to add --manage-cgroups to criu command
> > > line for both dump and restore.
> >
> > Thank you for looking into this. I'm not familiar with AUFS at all, thus I
> > only have comments about the way AUFS parsing/fixing code is integrated into
> > the rest of the criu code.
> >
> >
> > I am not familiar with AUFS internals either. I just looked at how it's set up and what information it passes through /proc.
> >
> >
> >
> >
> > Please, see my comments inline.
> >
> > > @@ -197,6 +199,19 @@ int fill_fdlink(int lfd, const struct fd_parms *p, struct fd_link *link)
> > > return -1;
> > > }
> > >
> > > + /*
> > > + * For AUFS support, we need to replace absolute
> > > + * branch pathnames with relative pathnames from root.
> > > + */
> > > + if (opts.aufs) {
> >
> > At this point we have the statfs() information. Can we rely on the
> > fs_type being aufs magic to check that the file is aufs one?
> >
> >
> > No, because depending on how we end up in fill_fdlink(), p->fstype may be NULL.
> > Besides, even when set, its fstyp isn't aufs. It'd be something like ext4 because
> > the branch is in ext4.
>
> Damn :( So it's not a "real" filesystem, it just redirects all calls to some
> other, underlying one, right? And if I opened a file which is on AUFS mount,
> I will end up in some other FS.
>
>
> Yes, it's not a "real" filesystem like ext4 but this can actually be a feature because
> you can stack different layers from different filesystems to present a consolidated fs
> view to the user.
>
>
>
> BTW, can we re-use the mnt_id for that? If I call open() on some path that
> is on AUFS mount point, would the mnt_id of a file be the AUFS's one?
>
>
> No, we cannot use mnt_id because it's uninitialized (-1). Even if it were initialized, I
> am not sure mnt_id alone would suffice to identify AUFS pathnames that should be replaced.
It's uninitialized, because CRIU hasn't read one in. I mean -- if we look at the fdinfo
data of a file opened on AUFS, would the mnt_id value be the one from AUFS mount, or from
the underlying FS mount?
> Also, please note that we're handling the files that the process is executing (e.g.,
> /bin/busybox, /lib/libuClibc-0.9.33.2.so <http://libuClibc-0.9.33.2.so>,
> /lib/ld64-uClibc-0.9.33.2.so <http://ld64-uClibc-0.9.33.2.so>), not the files that the
> process has opened. When we get to fill_fdlink() for these files, mnt_id is -1.
>
>
>
> > > + int n = sizeof link->name - 1;
> > > + n = fixup_aufs_path(&link->name[1], n, true);
> > > + if (n < 0)
> > > + return -1;
> > > + if (n > 0)
> > > + len = n;
> > > + }
> > > +
> > > link->len = len + 1;
> > > return 0;
> >
> > > @@ -450,12 +453,24 @@ int parse_smaps(pid_t pid, struct vm_area_list *vma_area_list, bool use_map_file
> > > vma_area->st = prev->st;
> > > } else if (vma_area->vm_file_fd >= 0) {
> > > struct stat *st_buf;
> > > + char *f;
> > >
> > > st_buf = vma_area->st = xmalloc(sizeof(*st_buf));
> > > if (!st_buf)
> > > goto err;
> > >
> > > - if (fstat(vma_area->vm_file_fd, st_buf) < 0) {
> > > + /*
> > > + * For AUFS support, we cannot fstat() file a descriptor that
> > > + * is a symbolic link to a branch. Instead, we obtain the
> > > + * pathname of the file from the root and use stat().
> > > + */
> > > + if (opts.aufs && (f = fixup_aufs_fd_path(vma_area->vm_file_fd))) {
> >
> > Would this work if the vm_file_fd sits in another mount namespace? Or
> > the path reported by fixup_aufs_fd_path() is always in the criu's one?
> >
> >
> > I think it should work, although I don't know how to force it in a different mount namespace in
> > the tests that I am running. Can you whip up an example?
>
> If you start a Docker container, all tasks in it will automatically happen
> in a mount namespace. You can check this by comparing where the /proc/pid/ns/mnt
> links point for a task in container and for e.g. your shell outside of it.
>
>
> Yes, when you do "docker run -i busybox:latest /bin/sh -i", /bin/sh comes from the container's mount
> namespace. I am not sure how to get /bin/sh from a mount namespace other than the container's.
I mean -- if we meet the /aufs_branch/foo/bar file in /proc/pid/map_files, is this path
we see visible from our mount namespace, or from the container's mount namespace?
>
>
> > And another question -- do we need the fixed stat buffer that early?
> >
> >
> > I am doing this exactly at the same time that CRIU is doing it (i.e., the fstat() is original code).
> > I'm just using stat() instead of fstat(), so effectively it's the same logic/time as before.
>
> OK. BTW, you say, that /proc/pid/map_files/ links point to "branches" paths.
> How about the /proc/pid/fd/ ones? Do they show paths from branches too?
>
>
> No, symlinks in /proc/pid/fd/ point to files relative to the root.
I.e. they are seen as if they are already fixup-ed by your code? I mean -- if I open /etc/hosts
and map /etc/hosts from container, would I see /etc/hosts in /proc/pid/fd and /val/lib/.../etc/hosts
in /proc/pid/map_files?
And another thing -- if we do open("/proc/pid/map_files/<something>") what would we see in the
newly appeared /proc/pid/fd/$fd ?
If /proc/pid/fd reveals "correct" paths, i.e. the /etc/hosts one, while /proc/pid/map_files/ show
"spoiled" paths, i.e. -- the ones from branches, then this is a kernel bug -- both directories
should behave the same way.
>
>
> > The vma fd is dumped in dump_filemap() and in that place we call
> > get_fd_mntid() for missing mount ID. Can we fixup the device there too?
> >
> >
> > The way I handle it is like this: If the link file descriptor points to an AUFS branch name,
> > we replace it with a pathname from root. Here is the call sequence from dump_filemap():
> >
> > dump_filemap()
> > dump_one_reg_file(lfd)
> > fill_fdlink(lfd)
> > read_fd_link(lfd)
> > readlink(lfd) // returns path in branch
> > fixup_aufs_brnach() // replaces path in branch with path from root
> > check_path_remap()
> >
> > In other words, when we see a pathname in a branch, we replace it with a pathname from root
> > as if we never saw the branch pathname. When using a link file descriptor in fstat(), because
> > the kernel returns the stat info of the pathname in branch, we use stat() with a pathname
> > from the root instead of fstat(). If we didn't do this, we'd get different device/inode
> > values and CRIU fails with an error message like "Unaccessible path opened 33:23, need 2049:53764".
>
> So if you stat() an fd opened on AUFS you get dev:inode pair from AUFS, not from the
> underlying ext4?
>
>
> Exactly. That's why we need to always stat the file either from the AUFS root or from the branch where
> it really lives. We cannot stat it from one path and expect to see the same dev:ino from a different path,
> although it's really the same file!
So, if by stat()-ing a file we see "virtual", i.e. the AUFS's device and inode, then we
can use this to distinguish AUFS files from non AUFS ones.
BTW, we have similar problem with btrfs -- it reports "wrong" device number in some
places, so we have to use the phys_stat_dev_match() helper to compare devices, maybe
something similar can be done with AUFS.
> > > + if (stat(f, st_buf) < 0) {
> > > + pr_perror("Failed fstat on %d's map %lu (%s)",
> > > + pid, start, f);
> > > + goto err;
> > > + }
> > > + } else if (fstat(vma_area->vm_file_fd, st_buf) < 0) {
> > > pr_perror("Failed fstat on %d's map %lu", pid, start);
> > > goto err;
> > > }
> >
> > > @@ -921,6 +942,16 @@ static int parse_mountinfo_ent(char *str, struct mount_info *new)
> > > if (ret != 3)
> > > return -1;
> > >
> > > + /* see comments in sysfs_parse.c */
> > > + if (opts.aufs && !strcmp(new->mountpoint, "./")) {
> > > + if (strcmp(fstype, "aufs")) {
> >
> > Can we reuse the fstype->parse() callback for this? This one gets called
> > in parse_mountinfo() in a loop after parse_mountinfo_ent().
> >
> >
> > Yes, we can and I changed the code accordingly (added aufs to fstypes[]). Please use the new patch.
>
> OK.
>
> > > + pr_err("Expected fstype aufs got %s\n", fstype);
> > > + return -1;
> > > + }
> > > + if (fixup_src_opt(&new->source, &opt) < 0)
> > > + return -1;
> > > + }
> > > +
> > > ret = -1;
> > > new->fstype = find_fstype_by_name(fstype);
> > >
> >
> > > +/*
> > > + * Copy the line in process's mountinfo file that corresponds
> > > + * to the mountpoint specified by the mntpoint argument. Return
> > > + * the number of characters parsed in the line, or -1 on error.
> > > + */
> > > +int get_mountinfo_by_mountpoint(pid_t pid, char *mntpoint, char *line, int linelen)
> >
> > I do not fully understand why this mountpoint->something-else parser
> > is required. Can we put all the code parsing aufs-specific stuff into
> > the fstype->parse() callback I mentioned above? When criu starts
> > parsing mount namespaces (it does this before doing anything else)
> > this info will be parsed and stored into mount_info->private. Later,
> > when we need to e.g. fixup file paths we can get file->mnt_id, get
> > mount_info by it, then check the fs being aufs and fix the path
> > according to the mount_info->private data we have. Would this work?
> >
> >
> > I am now using the parse callback that you suggested but we need the "reference" values
> > *before* parse_mountinfo() is called. Otherwise, when we call the parse callback, we
> > won't know what values to replace with.
>
> Ah, I see (I think). So by the time you get to parsing the AUFS's entry,
> you need to know the mount infos from other mountpoints which may not yet
> be parsed, am I right with that?
>
>
> You're right. That's why we call parse_aufs_reference() in collect_mntinfo() before calling parse_mountinfo().
OK, can we rework this part like this -- we add the .post_parse callback on the fstype,
and once parse_mountinfo meets an fs with one set it puts the created mi into separate
list and walks it, calling ->post_parse(), after all the mount-infos get collected.
>
>
>
> > That said, I noticed that with my latest rebase there is apparently no longer a need to
> > specify --aufs-ref. This is great as we can get rid of the code fixing up the mountinfo
> > root entry (note that we'd still need to replace / with the --aufs-root).
> >
> > Since I am not familiar with the internals of CRIU as to how it builds its bind mount
> > list, could you please review this part of the code? As mentioned before, it used to
> > be that without --aufs-ref, mounts_equal() would return false, in my test cases, for
> > /etc/hosts, /etc/hostname, and /etc/resolv.conf because their dev, fstype, source, and
> > options were different.
>
> So, the /etc/hosts is a mountpoint, whose root (the original file) is somewhere
> outside of the container, right? And the /etc/hosts file itself (i.e. -- before
> someone bind-mounts the external file on it) resides on AUFS, am I correct?
>
>
> Yes to both questions. You can see where /etc/hosts lives from the --ext-mount-map option.
OK, thanks for explanation.
> > As a result, these files would not be added to the mnt_bind
> > list in collect_shared() and restore would fail with the error message "A few mount
> > points can't be mounted." I am puzzled that restore is now successful without --aufs-ref.
> > I need to look into this deeper myself but would really appreciate your thoughts and feedback.
>
> I'll try :) Can you shed more light on the comment where you say that you
> replace device numbers, sources and options. AFAIU once you have AUFS's
> device, source and opt, you replace it with the underlying "branch"'s
> device, source and opt, do I get it correctly?
>
>
> When I investigated the cause of "A few mount points can't be mounted" during restore, I noticed
> that external bind mounts were not added to the mnt_bind list in collect_shared() because
> mounts_equal() was never true for these files when their mountinfo entry was compared with the root mountinfo entry.
>
> /* Search bind-mounts */
> if (list_empty(&m->mnt_bind)) {
> /*
> * A first mounted point will be set up as a source point
> * for others. Look at propagate_mount()
> */
> for (t = m->next; t; t = t->next) {
> if (mounts_equal(m, t, true))
> list_add(&t->mnt_bind, &m->mnt_bind);
> }
> }
>
> So, I "fixed up" the root mountinfo entry to have the same values as external bind mounts, effectively
> forcing a match in mounts_equal(), and restore worked.
>
> That said, this is not a good solution at all as it only works when all external bind mounts come from
> the same device and fiilesystem. But you can potentially have multiple external bind mounts each coming
> from a different device and filesystem, with different options. I need your help in addressing this issue.
>
> Finally, to complicate (or simplify?) things, after my latest rebase, I noticed that I don't have to fix
> up the root mountinfo entry anymore to get the external bind mounts mounted!! Somehow, all external bind
> mounts make it to the mnt_bind list. I do not know how to explain this because obviously mounts_equal()
> returns false if any of devices, filesystems, sources, or options are different.
>
> if (mi->s_dev != c->s_dev ||
> c->fstype != mi->fstype ||
> strcmp(c->source, mi->source) ||
> strcmp(c->options, mi->options))
> return false;
>
> Could it be related to some of the recent mount-related commits?
> For example, "mnt: Don't delay external mount points" 8b019e0bb4?
For external bind mounts -- yes, it can. Can you check this? But I have a feeling
that I don't understand the problem with mounts correctly.
Thanks,
Pavel
More information about the CRIU
mailing list