[CRIU] AUFS Support in CRIU
Saied Kazemi
saied at google.com
Thu Aug 14 12:22:36 PDT 2014
On Wed, Aug 13, 2014 at 1:13 AM, Pavel Emelyanov <xemul at parallels.com>
wrote:
> On 08/13/2014 04:29 AM, Saied Kazemi wrote:
>
> Hi, Saied.
>
> > Here is a new patch. I rebased to the head (commit ded04267f8) and
> cleaned up the code a
> > bit more. Please use the attached patch instead of the one I sent
> yesterday.
> >
> > Note that I had to delete 3 lines from cgroup.c for properties that I
> don't have on my
> > Ubuntu 14.04. Also, please note that you have to add --manage-cgroups
> to criu command
> > line for both dump and restore.
>
> Thank you for looking into this. I'm not familiar with AUFS at all, thus I
> only have comments about the way AUFS parsing/fixing code is integrated
> into
> the rest of the criu code.
>
I am not familiar with AUFS internals either. I just looked at how it's
set up and what information it passes through /proc.
>
> Please, see my comments inline.
>
> > @@ -197,6 +199,19 @@ int fill_fdlink(int lfd, const struct fd_parms *p,
> struct fd_link *link)
> > return -1;
> > }
> >
> > + /*
> > + * For AUFS support, we need to replace absolute
> > + * branch pathnames with relative pathnames from root.
> > + */
> > + if (opts.aufs) {
>
> At this point we have the statfs() information. Can we rely on the
> fs_type being aufs magic to check that the file is aufs one?
>
No, because depending on how we end up in fill_fdlink(), p->fstype may be
NULL. Besides, even when set, its fstyp isn't aufs. It'd be something
like ext4 because the branch is in ext4.
> > + int n = sizeof link->name - 1;
> > + n = fixup_aufs_path(&link->name[1], n, true);
> > + if (n < 0)
> > + return -1;
> > + if (n > 0)
> > + len = n;
> > + }
> > +
> > link->len = len + 1;
> > return 0;
>
> > @@ -450,12 +453,24 @@ int parse_smaps(pid_t pid, struct vm_area_list
> *vma_area_list, bool use_map_file
> > vma_area->st = prev->st;
> > } else if (vma_area->vm_file_fd >= 0) {
> > struct stat *st_buf;
> > + char *f;
> >
> > st_buf = vma_area->st = xmalloc(sizeof(*st_buf));
> > if (!st_buf)
> > goto err;
> >
> > - if (fstat(vma_area->vm_file_fd, st_buf) < 0) {
> > + /*
> > + * For AUFS support, we cannot fstat() file a
> descriptor that
> > + * is a symbolic link to a branch. Instead, we
> obtain the
> > + * pathname of the file from the root and use
> stat().
> > + */
> > + if (opts.aufs && (f =
> fixup_aufs_fd_path(vma_area->vm_file_fd))) {
>
> Would this work if the vm_file_fd sits in another mount namespace? Or
> the path reported by fixup_aufs_fd_path() is always in the criu's one?
>
I think it should work, although I don't know how to force it in a
different mount namespace in the tests that I am running. Can you whip up
an example?
> And another question -- do we need the fixed stat buffer that early?
>
I am doing this exactly at the same time that CRIU is doing it (i.e., the
fstat() is original code). I'm just using stat() instead of fstat(), so
effectively it's the same logic/time as before.
> The vma fd is dumped in dump_filemap() and in that place we call
> get_fd_mntid() for missing mount ID. Can we fixup the device there too?
>
The way I handle it is like this: If the link file descriptor points to an
AUFS branch name, we replace it with a pathname from root. Here is the
call sequence from dump_filemap():
dump_filemap()
dump_one_reg_file(lfd)
fill_fdlink(lfd)
read_fd_link(lfd)
readlink(lfd) // returns path in branch
fixup_aufs_brnach() // replaces path in branch with path from
root
check_path_remap()
In other words, when we see a pathname in a branch, we replace it with a
pathname from root as if we never saw the branch pathname. When using a
link file descriptor in fstat(), because the kernel returns the stat info
of the pathname in branch, we use stat() with a pathname from the root
instead of fstat(). If we didn't do this, we'd get different device/inode
values and CRIU fails with an error message like "Unaccessible path opened
33:23, need 2049:53764".
> > + if (stat(f, st_buf) < 0) {
> > + pr_perror("Failed fstat on %d's
> map %lu (%s)",
> > + pid, start, f);
> > + goto err;
> > + }
> > + } else if (fstat(vma_area->vm_file_fd, st_buf) <
> 0) {
> > pr_perror("Failed fstat on %d's map %lu",
> pid, start);
> > goto err;
> > }
>
> > @@ -921,6 +942,16 @@ static int parse_mountinfo_ent(char *str, struct
> mount_info *new)
> > if (ret != 3)
> > return -1;
> >
> > + /* see comments in sysfs_parse.c */
> > + if (opts.aufs && !strcmp(new->mountpoint, "./")) {
> > + if (strcmp(fstype, "aufs")) {
>
> Can we reuse the fstype->parse() callback for this? This one gets called
> in parse_mountinfo() in a loop after parse_mountinfo_ent().
>
Yes, we can and I changed the code accordingly (added aufs to fstypes[]).
Please use the new patch.
>
> > + pr_err("Expected fstype aufs got %s\n", fstype);
> > + return -1;
> > + }
> > + if (fixup_src_opt(&new->source, &opt) < 0)
> > + return -1;
> > + }
> > +
> > ret = -1;
> > new->fstype = find_fstype_by_name(fstype);
> >
>
> > +/*
> > + * Copy the line in process's mountinfo file that corresponds
> > + * to the mountpoint specified by the mntpoint argument. Return
> > + * the number of characters parsed in the line, or -1 on error.
> > + */
> > +int get_mountinfo_by_mountpoint(pid_t pid, char *mntpoint, char *line,
> int linelen)
>
> I do not fully understand why this mountpoint->something-else parser
> is required. Can we put all the code parsing aufs-specific stuff into
> the fstype->parse() callback I mentioned above? When criu starts
> parsing mount namespaces (it does this before doing anything else)
> this info will be parsed and stored into mount_info->private. Later,
> when we need to e.g. fixup file paths we can get file->mnt_id, get
> mount_info by it, then check the fs being aufs and fix the path
> according to the mount_info->private data we have. Would this work?
>
I am now using the parse callback that you suggested but we need the
"reference" values *before* parse_mountinfo() is called. Otherwise, when
we call the parse callback, we won't know what values to replace with.
That said, I noticed that with my latest rebase there is apparently no
longer a need to specify --aufs-ref. This is great as we can get rid of
the code fixing up the mountinfo root entry (note that we'd still need to
replace / with the --aufs-root).
Since I am not familiar with the internals of CRIU as to how it builds its
bind mount list, could you please review this part of the code? As
mentioned before, it used to be that without --aufs-ref, mounts_equal()
would return false, in my test cases, for /etc/hosts, /etc/hostname, and
/etc/resolv.conf because their dev, fstype, source, and options were
different. As a result, these files would not be added to the mnt_bind
list in collect_shared() and restore would fail with the error message "A
few mount points can't be mounted." I am puzzled that restore is now
successful without --aufs-ref. I need to look into this deeper myself but
would really appreciate your thoughts and feedback.
> > +{
> > + char buf[PATH_MAX];
> > + int n, ret;
> > + FILE *f;
>
> > +
> > + snprintf(buf, sizeof buf, "/proc/%d/mountinfo", pid);
> > + f = fopen(buf, "r");
> > + if (!f) {
> > + pr_perror("Cannot fopen %s", buf);
> > + return -1;
> > + }
> > +
> > + do {
> > + n = linelen - 2;
> > + line[n] = '\0'; /* detect long input */
> > + if (fgets(line, linelen, f) == NULL) {
> > + ret = 0;
> > + break;
> > + }
> > + if (line[n] && line[n] != '\n') {
> > + pr_err("Line in mountinfo too long\n");
> > + ret = -1;
> > + goto out;
> > + }
> > +
> > + ret = sscanf(line, "%*i %*i %*u:%*u %*s %s %*s - %n", buf,
> &n);
>
> Wow. Doesn't gcc prints a warning about "too few arguments for format"?
>
No, because * means discard what's scanned.
>
> > + if (ret != 1) {
> > + pr_err("Cannot parse mountpoint (%s)\n", line);
> > + ret = -1;
> > + goto out;
> > + }
> > + } while (strcmp(buf, mntpoint));
> > +
> > + if (!ret) {
> > + pr_err("Did not find %s in mountinfo\n", mntpoint);
> > + ret = -1;
> > + } else
> > + ret = n;
> > +
> > +out:
> > + fclose(f);
> > + return ret;
> > +}
>
> Thanks,
> Pavel
>
>
Cheers!
--Saied
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20140814/8d41cf7c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: aufs-3.patch
Type: text/x-patch
Size: 21654 bytes
Desc: not available
URL: <http://lists.openvz.org/pipermail/criu/attachments/20140814/8d41cf7c/attachment-0001.bin>
More information about the CRIU
mailing list