[Devel] Re: [RFC v14-rc2][PATCH 09/29] Dump open file descriptors

Sukadev Bhattiprolu sukadev at linux.vnet.ibm.com
Mon Apr 6 20:28:23 PDT 2009


A few minor comments.

Oren Laadan [orenl at cs.columbia.edu] wrote:
| From 6f0a1dc1db8fdac766b00f90e04e06a5827af459 Mon Sep 17 00:00:00 2001
| From: Oren Laadan <orenl at cs.columbia.edu>
| Date: Mon, 30 Mar 2009 13:59:34 -0400
| Subject: [PATCH 09/29] Dump open file descriptors
| 
| Dump the files_struct of a task with 'struct cr_hdr_files', followed by
| all open file descriptors. Because the 'struct file' corresponding to an
| FD can be shared, each they are assigned an objref and registered in the
| object hash. A reference to the 'file *' is kept for as long as it lives
| in the hash (the hash is only cleaned up at the end of the checkpoint).
| 
| For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its
| close-on-exec property, and the objref of the corresponding 'file *'.
| If the FD is to be saved (first time) then this is followed by a
| 'struct cr_hdr_fd_data' with the FD state. Then will come the next FD
| and so on.
| 
| Recall that it is assumed that all tasks possibly sharing the file table
| are frozen. If this assumption breaks, then the behavior is *undefined*:
| checkpoint may fail, or restart from the resulting image file will fail.
| 
| This patch only handles basic FDs - regular files, directories.
| 
| Changelog[v14]:
|   - Revert change to pr_debug(), back to cr_debug()
|   - Use only unsigned fields in checkpoint headers
|   - Rename:  cr_write_files() => cr_write_fd_table()
|   - Rename:  cr_write_fd_data() => cr_write_file()
|   - Discard field 'h->parent'
|   - Check whether calls to cr_hbuf_get() fail
|   - Use one CR_FD_GENERIC for both regular files and dirs
|   - Put code for generic file descriptors in a separate function
| 
| Changelog[v12]:
|   - Replace obsolete cr_debug() with pr_debug()
| 
| Changelog[v11]:
|   - Discard handling of opened symlinks (there is no such thing)
|   - cr_scan_fds() retries from scratch if hits size limits
| 
| Changelog[v9]:
|   - Fix a couple of leaks in cr_write_files()
|   - Drop useless kfree from cr_scan_fds()
| 
| Changelog[v8]:
|   - initialize 'coe' to workaround gcc false warning
| 
| Changelog[v6]:
|   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
|     (even though it's not really needed)
| 
| Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
| Acked-by: Serge Hallyn <serue at us.ibm.com>
| Signed-off-by: Dave Hansen <dave at linux.vnet.ibm.com>
| ---
|  arch/x86/include/asm/checkpoint_hdr.h |    2 +-
|  checkpoint/Makefile                   |    2 +-
|  checkpoint/checkpoint.c               |    4 +
|  checkpoint/checkpoint_file.h          |   17 +++
|  checkpoint/ckpt_file.c                |  247 +++++++++++++++++++++++++++++++++
|  include/linux/checkpoint.h            |    3 +-
|  include/linux/checkpoint_hdr.h        |   30 ++++-
|  7 files changed, 301 insertions(+), 4 deletions(-)
|  create mode 100644 checkpoint/checkpoint_file.h
|  create mode 100644 checkpoint/ckpt_file.c
| 
| diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
| index e9eb40c..1efdf24 100644
| --- a/arch/x86/include/asm/checkpoint_hdr.h
| +++ b/arch/x86/include/asm/checkpoint_hdr.h
| @@ -15,7 +15,7 @@
|  /*
|   * To maintain compatibility between 32-bit and 64-bit architecture flavors,
|   * keep data 64-bit aligned: use padding for structure members, and use
| - * __attribute__ ((aligned (8))) for the entire structure.
| + * __attribute__((aligned (8))) for the entire structure.
|   *
|   * Quoting Arnd Bergmann:
|   *   "This structure has an odd multiple of 32-bit members, which means
| diff --git a/checkpoint/Makefile b/checkpoint/Makefile
| index 8368a03..1d92ed2 100644
| --- a/checkpoint/Makefile
| +++ b/checkpoint/Makefile
| @@ -3,4 +3,4 @@
|  #
|  
|  obj-$(CONFIG_CHECKPOINT) += sys.o checkpoint.o restart.o objhash.o \
| -		ckpt_mem.o rstr_mem.o
| +		ckpt_mem.o rstr_mem.o ckpt_file.o
| diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
| index 422e1a3..d4e0007 100644
| --- a/checkpoint/checkpoint.c
| +++ b/checkpoint/checkpoint.c
| @@ -250,6 +250,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
|  	cr_debug("memory: ret %d\n", ret);
|  	if (ret < 0)
|  		goto out;
| +	ret = cr_write_fd_table(ctx, t);
| +	cr_debug("files: ret %d\n", ret);
| +	if (ret < 0)
| +		goto out;
|  	ret = cr_write_thread(ctx, t);
|  	cr_debug("thread: ret %d\n", ret);
|  	if (ret < 0)
| diff --git a/checkpoint/checkpoint_file.h b/checkpoint/checkpoint_file.h
| new file mode 100644
| index 0000000..9dc3eba
| --- /dev/null
| +++ b/checkpoint/checkpoint_file.h
| @@ -0,0 +1,17 @@
| +#ifndef _CHECKPOINT_CKPT_FILE_H_
| +#define _CHECKPOINT_CKPT_FILE_H_
| +/*
| + *  Checkpoint file descriptors
| + *
| + *  Copyright (C) 2008 Oren Laadan
| + *
| + *  This file is subject to the terms and conditions of the GNU General Public
| + *  License.  See the file COPYING in the main directory of the Linux
| + *  distribution for more details.
| + */
| +
| +#include <linux/fdtable.h>
| +
| +int cr_scan_fds(struct files_struct *files, int **fdtable);
| +
| +#endif /* _CHECKPOINT_CKPT_FILE_H_ */
| diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
| new file mode 100644
| index 0000000..9c344c7
| --- /dev/null
| +++ b/checkpoint/ckpt_file.c
| @@ -0,0 +1,247 @@
| +/*
| + *  Checkpoint file descriptors
| + *
| + *  Copyright (C) 2008-2009 Oren Laadan
| + *
| + *  This file is subject to the terms and conditions of the GNU General Public
| + *  License.  See the file COPYING in the main directory of the Linux
| + *  distribution for more details.
| + */
| +
| +#include <linux/kernel.h>
| +#include <linux/sched.h>
| +#include <linux/file.h>
| +#include <linux/fdtable.h>
| +#include <linux/checkpoint.h>
| +#include <linux/checkpoint_hdr.h>
| +
| +#include "checkpoint_file.h"
| +
| +#define CR_DEFAULT_FDTABLE  256		/* an initial guess */
| +
| +/**
| + * cr_scan_fds - scan file table and construct array of open fds
| + * @files: files_struct pointer
| + * @fdtable: (output) array of open fds
| + *
| + * Returns the number of open fds found, and also the file table
| + * array via *fdtable. The caller should free the array.
| + *
| + * The caller must validate the file descriptors collected in the
| + * array before using them, e.g. by using fcheck_files(), in case
| + * the task's fdtable changes in the meantime.
| + */
| +int cr_scan_fds(struct files_struct *files, int **fdtable)
| +{
| +	struct fdtable *fdt;
| +	int *fds = NULL;
| +	int i, n;
| +	int tot = CR_DEFAULT_FDTABLE;
| +
| +	/*
| +	 * We assume that all tasks possibly sharing the file table are
| +	 * frozen (or we our a single process and we checkpoint ourselves).

Nit: s/our/are/

| +	 * Therefore, we can safely proceed after krealloc() from where we
| +	 * left off. Otherwise the file table may be modified by another
| +	 * task after we scan it. The behavior is this case is undefined,
| +	 * and either and either checkpoint or restart will likely fail.

Nit: s/and either//

| +	 */
| + retry:
| +	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
| +	if (!fds)
| +		return -ENOMEM;
| +
| +	spin_lock(&files->file_lock);
| +	rcu_read_lock();
| +	fdt = files_fdtable(files);
| +	for (n = 0, i = 0; i < fdt->max_fds; i++) {

Hmm, if we want to start where we left-off before the krealloc, shouldn't
we initialize 'i' and 'n' to 0 before the 'retry:' ? Or maybe I misunderstand
what you mean by "where we left-off" comment above.



| +		if (!fcheck_files(files, i))
| +			continue;
| +		if (n == tot) {
| +			spin_unlock(&files->file_lock);
| +			rcu_read_unlock();
| +			tot *= 2;	/* won't overflow: kmalloc will fail */
| +			goto retry;
| +		}
| +		fds[n++] = i;
| +	}
| +	rcu_read_unlock();
| +	spin_unlock(&files->file_lock);
| +
| +	*fdtable = fds;
| +	return n;
| +}
| +
| +static int cr_write_file_generic(struct cr_ctx *ctx, struct file *file,
| +				 struct cr_hdr_file *hh)
| +{
| +	struct cr_hdr h;
| +	int ret;
| +
| +	/*
| +	 * FIX: check if the file/dir/link is unlinked
| +	 *
| +	 * Or, pass up somthing like in hh->flags to tell
| +	 * the higher-level code that it needs to bring
| +	 * along the file contents too.
| +	 */
| +
| +	h.type = CR_HDR_FILE;
| +	h.len = sizeof(*hh);
| +
| +	hh->fd_type = CR_FD_GENERIC;
| +
| +	ret = cr_write_obj(ctx, &h, hh);
| +	if (ret < 0)
| +		return ret;
| +
| +	return cr_write_fname(ctx, &file->f_path, &ctx->fs_mnt);
| +}
| +
| +/* cr_write_file - dump the state of a given file pointer */
| +static int cr_write_file(struct cr_ctx *ctx, struct file *file)
| +{
| +	struct cr_hdr_file *hh;
| +	struct dentry *dent = file->f_dentry;
| +	struct inode *inode = dent->d_inode;
| +	int ret;
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	hh->f_flags = file->f_flags;
| +	hh->f_mode = file->f_mode;
| +	hh->f_pos = file->f_pos;
| +	hh->f_version = file->f_version;
| +	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
| +
| +	/*
| +	 * FIXME: when we'll add support for unlinked files/dirs, we'll
| +	 * need to distinguish between unlinked filed and unlinked dirs.
| +	 */
| +	switch (inode->i_mode & S_IFMT) {
| +	case S_IFREG:
| +	case S_IFDIR:
| +		ret = cr_write_file_generic(ctx, file, hh);
| +		break;
| +	default:
| +		ret = -EBADF;
| +		break;
| +	}
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +
| +	return ret;
| +}
| +
| +/**
| + * cr_write_fd_ent - dump the state of a given file descriptor
| + * @ctx: checkpoint context
| + * @files: files_struct pointer
| + * @fd: file descriptor
| + *
| + * Saves the state of the file descriptor; looks up the actual file
| + * pointer in the hash table, and if found saves the matching objref,
| + * otherwise calls cr_write_file to dump the file pointer too.
| + */
| +static int
| +cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
| +{
| +	struct cr_hdr h;
| +	struct cr_hdr_fd_ent *hh;
| +	struct file *file;
| +	struct fdtable *fdt;
| +	int objref, new, ret;
| +	int coe = 0;	/* avoid gcc warning */
| +
| +	rcu_read_lock();
| +	fdt = files_fdtable(files);
| +	file = fcheck_files(files, fd);
| +	if (file) {
| +		coe = FD_ISSET(fd, fdt->close_on_exec);
| +		get_file(file);
| +	}
| +	rcu_read_unlock();
| +
| +	/* sanity check (although this shouldn't happen) */
| +	if (!file)
| +		return -EBADF;
| +
| +	/* adding 'file' to the hash will keep a reference to it */
| +	new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
| +	cr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
| +
| +	if (new < 0)
| +		return new;
| +
| +	h.type = CR_HDR_FD_ENT;
| +	h.len = sizeof(*hh);
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh) {
| +		fput(file);
| +		return -ENOMEM;
| +	}
| +
| +	hh->objref = objref;
| +	hh->fd = fd;
| +	hh->close_on_exec = coe;
| +
| +	ret = cr_write_obj(ctx, &h, hh);
| +	if (ret < 0)
| +		goto out;
| +
| +	/* new==1 if-and-only-if file was newly added to hash */
| +	if (new)
| +		ret = cr_write_file(ctx, file);
| +
| +out:
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	if (file)
| +		fput(file);
| +	return ret;
| +}
| +
| +int cr_write_fd_table(struct cr_ctx *ctx, struct task_struct *t)
| +{
| +	struct cr_hdr h;
| +	struct cr_hdr_fd_table *hh;
| +	struct files_struct *files;
| +	int *fdtable = NULL;
| +	int nfds, n, ret;
| +
| +	h.type = CR_HDR_FD_TABLE;
| +	h.len = sizeof(*hh);
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	files = get_files_struct(t);
| +
| +	nfds = cr_scan_fds(files, &fdtable);
| +	if (nfds < 0) {
| +		ret = nfds;
| +		goto out;
| +	}
| +
| +	hh->objref = 0;	/* will be meaningful with multiple processes */
| +	hh->nfds = nfds;
| +
| +	ret = cr_write_obj(ctx, &h, hh);
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	if (ret < 0)
| +		goto out;
| +
| +	cr_debug("nfds %d\n", nfds);
| +	for (n = 0; n < nfds; n++) {
| +		ret = cr_write_fd_ent(ctx, files, fdtable[n]);
| +		if (ret < 0)
| +			break;
| +	}
| +
| + out:
| +	kfree(fdtable);
| +	put_files_struct(files);
| +	return ret;
| +}
| diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
| index 88854a9..9489ea5 100644
| --- a/include/linux/checkpoint.h
| +++ b/include/linux/checkpoint.h
| @@ -13,7 +13,7 @@
|  #include <linux/path.h>
|  #include <linux/fs.h>
|  
| -#define CR_VERSION  1
| +#define CR_VERSION  2
|  
|  struct cr_ctx {
|  	int crid;		/* unique checkpoint id */
| @@ -85,6 +85,7 @@ extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
|  
|  extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
|  extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
| +extern int cr_write_fd_table(struct cr_ctx *ctx, struct task_struct *t);
|  
|  extern int do_restart(struct cr_ctx *ctx, pid_t pid);
|  extern int cr_read_mm(struct cr_ctx *ctx);
| diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
| index 2a06a2f..a6b6dce 100644
| --- a/include/linux/checkpoint_hdr.h
| +++ b/include/linux/checkpoint_hdr.h
| @@ -17,7 +17,7 @@
|  /*
|   * To maintain compatibility between 32-bit and 64-bit architecture flavors,
|   * keep data 64-bit aligned: use padding for structure members, and use
| - * __attribute__ ((aligned (8))) for the entire structure.
| + * __attribute__((aligned (8))) for the entire structure.

Nit: Fix in earlier patch ?

Sukadev
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers




More information about the Devel mailing list