[Devel] Re: [PATCH 5/6] [RFC] Checkpoint/restart unlinked files
Sukadev Bhattiprolu
sukadev at linux.vnet.ibm.com
Fri Oct 22 16:43:44 PDT 2010
Matt Helsley [matthltc at us.ibm.com] wrote:
<snip>
| To understand why relinking is extremely useful for checkpoint/restart
| consider this simple pseudocode program and a specific example checkpoint
| of it:
I can see how relinking the file simplifies C/R :-) But patch 2 indicates
not all filesystems can support relink. Hope they aren't too many of those.
|
| a_fd = open("a"); /* example: size of the file at "a" is 1GB */
| link("a", "b");
| unlink("a");
| creat("a");
| <---- example: checkpoint happens here
| write(a_fd, "bar");
|
| The file "a" is unlinked and a different file has been placed at that
| path. a_fd still refers to the inode shared with "b".
|
| Without relinking we would need to walk the entire filesystem to find out
| that "b" is a path to the same inode
You may want to mention here that to checkpoint/restart a file, we save/
restore the pathname. So finding a path for the unliked file 'a' would
require walking the entire filesystem to find any alias.
| (another variation on this case: "b"
| would also have been unlinked). We'd need to do this for every
| unlinked file that remains open in every task to checkpoint. Even then
| there is no guarantee such a "b" exists for every unlinked file -- the
| inodes could be "orphans" -- and we'd need to preserve their contents
| some other way.
|
| I considered a couple alternatives to preserving unlinked file contents:
s/couple/couple of/
| copying and file handles. Each has significant drawbacks.
|
| First I attempted to copy the file contents into the image and then
| recreate and unlink the file during restart. Using a simple version of
| that method the write above would not reach "b". One fix would be to search
| the filesystem for a file with the same inode number (inode of "b") and
| either open it or hardlink it to "a". Another would be to record the inode
| number. This either shifts the search from checkpoint time to restart time
| or has all the drawbacks of the second method I considered: file handles.
|
| Instead of copying contents or recording inodes I also considered using
| file handles. We'd need to ensure that the filehandles persist in storage,
| can be snapshotted/backed up, and can be migrated. Can handlefs or any
| generic file handle system do this? My _guess_ is "no" but folks are
| welcome to tell me I'm wrong.
|
| In contrast, linking the file from a_fd back into its filesystem can avoid
| these complexities. Relinking avoids the search for matching inodes and
| copying large quantities of data from storage only to write it back (in
| fact the data would be read-and-written twice -- once for checkpoint and
| once for restart). Like file handles it does require changes to the
| filesystem code. Unlike file handles, enabling relinking does not require
| every filesystem to support a new kind of filesystem "object" -- only
| an operation that is quite similar to one that already exists: link.
|
| Signed-off-by: Matt Helsley <matthltc at us.ibm.com>
| Cc: Eric Sandeen <sandeen at redhat.com>
| Cc: Theodore Ts'o <tytso at mit.edu>
| Cc: Andreas Dilger <adilger.kernel at dilger.ca>
| Cc: linux-ext4 at vger.kernel.org
| Cc: Jan Kara <jack at suse.cz>
| Cc: containers at lists.linux-foundation.org
| Cc: Oren Laadan <orenl at cs.columbia.edu>
| Cc: linux-fsdevel at vger.kernel.org
| Cc: Al Viro <viro at zeniv.linux.org.uk>
| Cc: Christoph Hellwig <hch at infradead.org>
| Cc: Jamie Lokier <jamie at shareable.org>
| Cc: Amir Goldstein <amir73il at users.sf.net>
| Cc: Aneesh Kumar <aneesh.kumar at linux.vnet.ibm.com>
| Cc: Miklos Szeredi <miklos at szeredi.hu>
| ---
| fs/checkpoint.c | 51 ++++++++++++++-----
| fs/namei.c | 102 ++++++++++++++++++++++++++++++++++++++
| fs/pipe.c | 2 +-
| include/linux/checkpoint.h | 3 +-
| include/linux/checkpoint_hdr.h | 3 +
| include/linux/checkpoint_types.h | 3 +
| 6 files changed, 149 insertions(+), 15 deletions(-)
|
| diff --git a/fs/checkpoint.c b/fs/checkpoint.c
| index 87d7c6e..9c7caec 100644
| --- a/fs/checkpoint.c
| +++ b/fs/checkpoint.c
| @@ -16,6 +16,7 @@
| #include <linux/sched.h>
| #include <linux/file.h>
| #include <linux/namei.h>
| +#include <linux/mount.h>
| #include <linux/fs_struct.h>
| #include <linux/fs.h>
| #include <linux/fdtable.h>
| @@ -26,6 +27,7 @@
| #include <linux/checkpoint.h>
| #include <linux/eventpoll.h>
| #include <linux/eventfd.h>
| +#include <linux/sys-wrapper.h>
| #include <net/sock.h>
|
| /**************************************************************************
| @@ -174,6 +176,9 @@ int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
| h->f_pos = file->f_pos;
| h->f_version = file->f_version;
|
| + if (d_unlinked(file->f_dentry))
| + /* Perform post-checkpoint and post-restart unlink() */
| + h->f_restart_flags |= RESTART_FILE_F_UNLINK;
| h->f_credref = checkpoint_obj(ctx, f_cred, CKPT_OBJ_CRED);
| if (h->f_credref < 0)
| return h->f_credref;
| @@ -197,16 +202,6 @@ int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
| struct ckpt_hdr_file_generic *h;
| int ret;
|
| - /*
| - * FIXME: when we'll add support for unlinked files/dirs, we'll
| - * need to distinguish between unlinked filed and unlinked dirs.
| - */
| - if (d_unlinked(file->f_dentry)) {
| - ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n",
| - file);
| - return -EBADF;
| - }
| -
| h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
| if (!h)
| return -ENOMEM;
| @@ -220,6 +215,9 @@ int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
| if (ret < 0)
| goto out;
| ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path);
Hmm, what file name will be checkpointed here, if the file is unlinked ?
| + if (ret < 0)
| + goto out;
| + ret = checkpoint_file_links(ctx, file);
| out:
| ckpt_hdr_put(ctx, h);
| return ret;
| @@ -570,9 +568,11 @@ static int ckpt_read_fname(struct ckpt_ctx *ctx, char **fname)
| /**
| * restore_open_fname - read a file name and open a file
| * @ctx: checkpoint context
| + * @restore_unlinked: unlink the opened file
| * @flags: file flags
| */
| -struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags)
| +struct file *restore_open_fname(struct ckpt_ctx *ctx,
| + int restore_unlinked, int flags)
nit: s/restore_unlinked/unlinked/ ?
| {
| struct file *file;
| char *fname;
| @@ -586,8 +586,33 @@ struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags)
| if (len < 0)
| return ERR_PTR(len);
| ckpt_debug("fname '%s' flags %#x\n", fname, flags);
| -
| + if (restore_unlinked) {
| + kfree(fname);
| + fname = NULL;
| + len = ckpt_read_payload(ctx, (void **)&fname, PATH_MAX,
| + CKPT_HDR_BUFFER);
Hmm, is there a reason we need a special way to read the file name for
unlinked files ? After re-linking the file during checkpoint, can we
not treat it like any other open file (except for the flag) ?
| + if (len < 0)
| + return ERR_PTR(len);
| + fname[len] = '\0';
| + }
| file = filp_open(fname, flags, 0);
| + if (IS_ERR(file)) {
| + ckpt_err(ctx, PTR_ERR(file), "Could not open file \"%s\"\n", fname);
| +
| + goto out;
| + }
| + if (!restore_unlinked)
| + goto out;
| + if (S_ISDIR(file->f_mapping->host->i_mode))
| + len = kernel_sys_rmdir(fname);
| + else
| + len = kernel_sys_unlink(fname);
| + if (len < 0) {
| + ckpt_err(ctx, len, "Could not unlink \"%s\"\n", fname);
| + fput(file);
| + file = ERR_PTR(len);
| + }
nit: how about moving this unlink block to a smaller function ?
| +out:
| kfree(fname);
|
| return file;
| @@ -692,7 +717,7 @@ static struct file *generic_file_restore(struct ckpt_ctx *ctx,
| ptr->h.len != sizeof(*ptr) || ptr->f_type != CKPT_FILE_GENERIC)
| return ERR_PTR(-EINVAL);
|
| - file = restore_open_fname(ctx, ptr->f_flags);
| + file = restore_open_fname(ctx, !!(ptr->f_restart_flags & RESTART_FILE_F_UNLINK), ptr->f_flags);
nit: long line
| if (IS_ERR(file))
| return file;
|
| diff --git a/fs/namei.c b/fs/namei.c
| index 8c9663d..69c4f4e 100644
| --- a/fs/namei.c
| +++ b/fs/namei.c
| @@ -32,6 +32,9 @@
| #include <linux/fcntl.h>
| #include <linux/device_cgroup.h>
| #include <linux/fs_struct.h>
| +#ifdef CONFIG_CHECKPOINT
| +#include <linux/checkpoint.h>
| +#endif
| #include <asm/uaccess.h>
|
| #include "internal.h"
| @@ -2543,6 +2546,105 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
| return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
| }
|
| +#ifdef CONFIG_CHECKPOINT
| +
| +/* Path relative to the mounted filesystem's root -- not a "global" root or even a namespace root. The unique_name_count is unique for the entire checkpoint. */
| +#define CKPT_RELINKAT_FMT "lost+found/checkpoint/unlink-at-restart-%08llx-%u"
| +
| +static int checkpoint_fill_relink_fname(struct ckpt_ctx *ctx,
nit. since it is a static function, we could probably drop the 'checkpoint_'
prefix in the name ?
| + struct file *for_file,
| + char relink_dir_pathname[PATH_MAX],
| + int *lenp)
| +{
| + struct path relink_dir_path;
nit. since the function name has "relink", maybe variable names can skip
(code is easier to read with smaller variable names).
| + char *tmp;
| + int len;
| +
| + /* Find path to mount */
| + relink_dir_path.mnt = for_file->f_path.mnt;
| + relink_dir_path.dentry = relink_dir_path.mnt->mnt_root;
| + tmp = d_path(&relink_dir_path, relink_dir_pathname, PATH_MAX);
| + if (IS_ERR(tmp))
| + return PTR_ERR(tmp);
| +
| + /* Append path to relinked file. */
| + len = strlen(tmp);
| + if (len <= 0)
| + return -ENOENT;
| + memmove(relink_dir_pathname, tmp, len);
| + tmp = relink_dir_pathname + len - 1;
| + /* Ensure we've got a single dir separator */
| + if (*tmp == '/')
| + tmp++;
| + else {
| + tmp++;
we could simplify the 'if-else' by making the tmp++ unconditional (or by
removing the -1 above).
| + *tmp = '/';
| + tmp++;
| + len++;
| + }
| + len += snprintf(tmp, PATH_MAX - len, CKPT_RELINKAT_FMT,
| + ctx->ktime_begin.tv64,
| + ++ctx->unique_name_count);
Since the format is dependent on additional parameters (tv64, unique_name_count)
any changes to the format will require updates in multiple places in the future
right ? That would make the CKPT_RELINKAT_FMT macro less useful.
Instead how about a function like this that could be used during both checkpoint
and restart:
static inline int generate_relinked_path(ctx, buf, len)
{
return sprintf(...);
}
| + relink_dir_pathname[len] = '\0';
| + *lenp = len;
| + return 0;
| +}
| +
| +static int checkpoint_file_relink(struct ckpt_ctx *ctx,
| + struct file *file,
| + char new_path[PATH_MAX])
| +{
| + int ret, len;
| +
| + /*
| + * Relinking arbitrary files without searching a path
| + * (which non-existent if the file is unlinked) requires
s/which/which is/
s/file is/file was/
| + * special privileges.
| + */
| + if (!capable(CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH)) {
| + ckpt_err(ctx, -EPERM, "%(T)Relinking unlinked files requires CAP_DAC_{OVERRIDE,READ_SEARCH}\n");
nit: long line
| + return -EPERM;
| + }
nit: a blank line here might help
| + ret = checkpoint_fill_relink_fname(ctx, file, new_path, &len);
| + if (ret)
| + return ret;
| + ret = do_kern_linkat(&file->f_path, file->f_dentry,
| + AT_FDCWD, new_path, 0);
| + if (ret)
| + ckpt_err(ctx, ret, "%(T)%(P)%(V)Failed to relink unlinked file.\n", file, file->f_op);
nit: long line
| + return ret;
| +}
| +
| +int checkpoint_file_links(struct ckpt_ctx *ctx, struct file *file)
| +{
| + char *new_link_path;
| + int ret, len;
| +
| + if (!d_unlinked(file->f_dentry))
| + return 0;
| +
| + /*
| + * Unlinked files need at least one hardlink for the post-sys_checkpoint
| + * filesystem backup/snapshot.
| + */
| + new_link_path = kmalloc(PATH_MAX, GFP_KERNEL);
| + if (!new_link_path)
| + return -ENOMEM;
| + ret = checkpoint_file_relink(ctx, file, new_link_path);
| + if (ret < 0)
| + goto out_free;
| + len = strlen(new_link_path);
| + ret = ckpt_write_obj_type(ctx, NULL, len + 1, CKPT_HDR_BUFFER);
| + if (ret < 0)
| + goto out_free;
| + ret = ckpt_kwrite(ctx, new_link_path, len + 1);
| +out_free:
| + kfree(new_link_path);
| +
| + return ret;
| +}
nit: some blank lines separating the different sections of the function will
help readability
Sukadev
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list