[Devel] Re: [RFC v14-rc2][PATCH 15/29] Restart multiple processes
Sukadev Bhattiprolu
sukadev at linux.vnet.ibm.com
Mon Apr 6 20:33:15 PDT 2009
Couple of nits and couple of not-so minor comments
Oren Laadan [orenl at cs.columbia.edu] wrote:
| From 7162fef93ee3d9fd30a457dd7b0c7ad0200d5bcb Mon Sep 17 00:00:00 2001
| From: Oren Laadan <orenl at cs.columbia.edu>
| Date: Mon, 30 Mar 2009 15:06:13 -0400
| Subject: [PATCH 15/29] Restart multiple processes
|
| Restarting of multiple processes expects all restarting tasks to call
| sys_restart(). Once inside the system call, each task will restart
| itself at the same order that they were saved. The internals of the
| syscall will take care of in-kernel synchronization bewteen tasks.
|
| This patch does _not_ create the task tree in the kernel. Instead it
| assumes that all tasks are created in some way and then invoke the
| restart syscall. You can use the userspace mktree.c program to do
| that.
|
| The init task (*) has a special role: it allocates the restart context
| (ctx), and coordinates the operation. In particular, it first waits
| until all participating tasks enter the kernel, and provides them the
| common restart context. Once everyone in ready, it begins to restart
| itself.
|
| In contrast, the other tasks enter the kernel, locate the init task (*)
| and grab its restart context, and then wait for their turn to restore.
|
| When a task (init or not) completes its restart, it hands the control
| over to the next in line, by waking that task.
|
| An array of pids (the one saved during the checkpoint) is used to
| synchronize the operation. The first task in the array is the init
| task (*). The restart context (ctx) maintain a "current position" in
| the array, which indicates which task is currently active. Once the
| currently active task completes its own restart, it increments that
| position and wakes up the next task.
|
| Restart assumes that userspace provides meaningful data, otherwise
| it's garbage-in-garbage-out. In this case, the syscall may block
| indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
| otherwise kill the stray restarting tasks.
|
| In terms of security, restart runs as the user the invokes it, so it
| will not allow a user to do more than is otherwise permitted by the
| usual system semantics and policy.
|
| Currently we ignore threads and zombies, as well as session ids.
| Add support for multiple processes
|
| (*) For containers, restart should be called inside a fresh container
| by the init task of that container. However, it is also possible to
| restart applications not necessarily inside a container, and without
| restoring the original pids of the processes (that is, provided that
| the application can tolerate such behavior). This is useful to allow
| multi-process restart of tasks not isolated inside a container, and
| also for debugging.
|
| Changelog[v14]:
| - Revert change to pr_debug(), back to cr_debug()
| - Discard field 'h.parent'
| - Check whether calls to cr_hbuf_get() fail
|
| Changelog[v13]:
| - Clear root_task->checkpoint_ctx regardless of error condition
| - Remove unused argument 'ctx' from do_restart_task() prototype
| - Remove unused member 'pids_err' from 'struct cr_ctx'
|
| Changelog[v12]:
| - Replace obsolete cr_debug() with pr_debug()
|
| Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
| Acked-by: Serge Hallyn <serue at us.ibm.com>
| ---
| checkpoint/restart.c | 224 +++++++++++++++++++++++++++++++++++++++++++-
| checkpoint/sys.c | 34 ++++++--
| include/linux/checkpoint.h | 24 ++++-
| include/linux/sched.h | 4 +
| 4 files changed, 272 insertions(+), 14 deletions(-)
|
| diff --git a/checkpoint/restart.c b/checkpoint/restart.c
| index 96d4d45..adebc1c 100644
| --- a/checkpoint/restart.c
| +++ b/checkpoint/restart.c
| @@ -10,6 +10,7 @@
|
| #include <linux/version.h>
| #include <linux/sched.h>
| +#include <linux/wait.h>
| #include <linux/file.h>
| #include <linux/magic.h>
| #include <linux/checkpoint.h>
| @@ -301,30 +302,245 @@ static int cr_read_task(struct cr_ctx *ctx)
| return ret;
| }
|
| +/* cr_read_tree - read the tasks tree into the checkpoint context */
| +static int cr_read_tree(struct cr_ctx *ctx)
| +{
| + struct cr_hdr_tree *hh;
| + int size, ret;
| +
| + hh = cr_hbuf_get(ctx, sizeof(*hh));
| + if (!hh)
| + return -ENOMEM;
| +
| + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TREE);
| + if (ret < 0)
| + goto out;
| +
| + ret = -EINVAL;
| + if (hh->tasks_nr < 0)
| + goto out;
| +
| + ctx->pids_nr = hh->tasks_nr;
| + size = sizeof(*ctx->pids_arr) * ctx->pids_nr;
| + if (size < 0) /* overflow ? */
| + goto out;
| +
| + ctx->pids_arr = kmalloc(size, GFP_KERNEL);
| + if (!ctx->pids_arr) {
| + ret = -ENOMEM;
| + goto out;
| + }
| + ret = cr_kread(ctx, ctx->pids_arr, size);
| + out:
| + cr_hbuf_put(ctx, sizeof(*hh));
| + return ret;
| +}
| +
| +static int cr_wait_task(struct cr_ctx *ctx)
| +{
| + pid_t pid = task_pid_vnr(current);
| +
| + cr_debug("pid %d waiting\n", pid);
| + return wait_event_interruptible(ctx->waitq, ctx->pids_active == pid);
| +}
| +
| +static int cr_next_task(struct cr_ctx *ctx)
| +{
| + struct task_struct *tsk;
| +
| + ctx->pids_pos++;
| +
| + cr_debug("pids_pos %d %d\n", ctx->pids_pos, ctx->pids_nr);
| + if (ctx->pids_pos == ctx->pids_nr) {
| + complete(&ctx->complete);
| + return 0;
| + }
| +
| + ctx->pids_active = ctx->pids_arr[ctx->pids_pos].vpid;
| +
| + cr_debug("pids_next %d\n", ctx->pids_active);
| +
| + rcu_read_lock();
| + tsk = find_task_by_pid_ns(ctx->pids_active, ctx->root_nsproxy->pid_ns);
| + if (tsk)
| + wake_up_process(tsk);
| + rcu_read_unlock();
| +
| + if (!tsk) {
| + complete(&ctx->complete);
| + return -ESRCH;
| + }
| +
| + return 0;
| +}
| +
| +/* FIXME: this should be per container */
| +DECLARE_WAIT_QUEUE_HEAD(cr_restart_waitq);
| +
| +static int do_restart_task(pid_t pid)
| +{
| + struct task_struct *root_task;
| + struct cr_ctx *ctx = NULL;
| + int ret;
| +
| + rcu_read_lock();
| + root_task = find_task_by_pid_ns(pid, current->nsproxy->pid_ns);
| + if (root_task)
| + get_task_struct(root_task);
| + rcu_read_unlock();
| +
| + if (!root_task)
| + return -EINVAL;
| +
| + /*
| + * wait for container init to initialize the restart context, then
| + * grab a reference to that context, and if we're the last task to
| + * do it, notify the container init.
| + */
| + ret = wait_event_interruptible(cr_restart_waitq,
| + root_task->checkpoint_ctx);
| + if (ret < 0)
| + goto out;
| +
| + task_lock(root_task);
| + ctx = root_task->checkpoint_ctx;
| + if (ctx)
| + cr_ctx_get(ctx);
| + task_unlock(root_task);
| +
| + if (!ctx) {
| + ret = -EAGAIN;
| + goto out;
| + }
| +
| + if (atomic_dec_and_test(&ctx->tasks_count))
| + complete(&ctx->complete);
| +
| + /* wait for our turn, do the restore, and tell next task in line */
| + ret = cr_wait_task(ctx);
| + if (ret < 0)
| + goto out;
| + ret = cr_read_task(ctx);
| + if (ret < 0)
| + goto out;
| + ret = cr_next_task(ctx);
| +
| + out:
| + cr_ctx_put(ctx);
| + put_task_struct(root_task);
| + return ret;
| +}
| +
| +/**
| + * cr_wait_all_tasks_start - wait for all tasks to enter sys_restart()
| + * @ctx: checkpoint context
| + *
| + * Called by the container root to wait until all restarting tasks
| + * are ready to restore their state. Temporarily advertises the 'ctx'
| + * on 'current->checkpoint_ctx' so that others can grab a reference
| + * to it, and clears it once synchronization completes. See also the
| + * related code in do_restart_task().
| + */
| +static int cr_wait_all_tasks_start(struct cr_ctx *ctx)
| +{
| + int ret;
| +
| + if (ctx->pids_nr == 1)
| + return 0;
| +
| + init_completion(&ctx->complete);
| + current->checkpoint_ctx = ctx;
| +
| + wake_up_all(&cr_restart_waitq);
| +
| + ret = wait_for_completion_interruptible(&ctx->complete);
| +
| + task_lock(current);
| + current->checkpoint_ctx = NULL;
| + task_unlock(current);
| +
| + return ret;
| +}
| +
| +static int cr_wait_all_tasks_finish(struct cr_ctx *ctx)
| +{
| + int ret;
| +
| + if (ctx->pids_nr == 1)
| + return 0;
| +
| + init_completion(&ctx->complete);
| +
| + ret = cr_next_task(ctx);
| + if (ret < 0)
| + return ret;
| +
| + ret = wait_for_completion_interruptible(&ctx->complete);
| + if (ret < 0)
| + return ret;
| +
| + return 0;
| +}
| +
| /* setup restart-specific parts of ctx */
| static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
| {
| + ctx->root_pid = pid;
| + ctx->root_task = current;
| + ctx->root_nsproxy = current->nsproxy;
| +
| + get_task_struct(ctx->root_task);
| + get_nsproxy(ctx->root_nsproxy);
| +
| + atomic_set(&ctx->tasks_count, ctx->pids_nr - 1);
| +
| return 0;
| }
|
| -int do_restart(struct cr_ctx *ctx, pid_t pid)
| +static int do_restart_root(struct cr_ctx *ctx, pid_t pid)
| {
| int ret;
|
| + ret = cr_read_head(ctx);
| + if (ret < 0)
| + goto out;
| + ret = cr_read_tree(ctx);
| + if (ret < 0)
| + goto out;
| +
| ret = cr_ctx_restart(ctx, pid);
| if (ret < 0)
| goto out;
| - ret = cr_read_head(ctx);
| +
| + /* wait for all other tasks to enter do_restart_task() */
| + ret = cr_wait_all_tasks_start(ctx);
| if (ret < 0)
| goto out;
| +
| ret = cr_read_task(ctx);
| if (ret < 0)
| goto out;
| - ret = cr_read_tail(ctx);
| +
| + /* wait for all other tasks to complete do_restart_task() */
| + ret = cr_wait_all_tasks_finish(ctx);
| if (ret < 0)
| goto out;
|
| - /* on success, adjust the return value if needed [TODO] */
| + ret = cr_read_tail(ctx);
| +
| out:
| return ret;
| }
| +
| +int do_restart(struct cr_ctx *ctx, pid_t pid)
| +{
| + int ret;
| +
| + if (ctx)
| + ret = do_restart_root(ctx, pid);
| + else
| + ret = do_restart_task(pid);
| +
| + /* on success, adjust the return value if needed [TODO] */
| + return ret;
| +}
| diff --git a/checkpoint/sys.c b/checkpoint/sys.c
| index 8630144..3a925ae 100644
| --- a/checkpoint/sys.c
| +++ b/checkpoint/sys.c
| @@ -167,6 +167,8 @@ static void cr_task_arr_free(struct cr_ctx *ctx)
|
| static void cr_ctx_free(struct cr_ctx *ctx)
| {
| + BUG_ON(atomic_read(&ctx->refcount));
| +
| if (ctx->file)
| fput(ctx->file);
|
| @@ -185,6 +187,8 @@ static void cr_ctx_free(struct cr_ctx *ctx)
| if (ctx->root_task)
| put_task_struct(ctx->root_task);
|
| + kfree(ctx->pids_arr);
| +
| kfree(ctx);
| }
|
| @@ -199,8 +203,10 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
|
| ctx->flags = flags;
|
| + atomic_set(&ctx->refcount, 0);
| INIT_LIST_HEAD(&ctx->pgarr_list);
| INIT_LIST_HEAD(&ctx->pgarr_pool);
| + init_waitqueue_head(&ctx->waitq);
|
| err = -EBADF;
| ctx->file = fget(fd);
| @@ -215,6 +221,7 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
| if (cr_objhash_alloc(ctx) < 0)
| goto err;
|
| + atomic_inc(&ctx->refcount);
| return ctx;
|
| err:
| @@ -222,6 +229,17 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
| return ERR_PTR(err);
| }
|
| +void cr_ctx_get(struct cr_ctx *ctx)
| +{
| + atomic_inc(&ctx->refcount);
| +}
| +
| +void cr_ctx_put(struct cr_ctx *ctx)
| +{
| + if (ctx && atomic_dec_and_test(&ctx->refcount))
| + cr_ctx_free(ctx);
| +}
| +
| /**
| * sys_checkpoint - checkpoint a container
| * @pid: pid of the container init(1) process
| @@ -251,7 +269,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
| if (!ret)
| ret = ctx->crid;
|
| - cr_ctx_free(ctx);
| + cr_ctx_put(ctx);
| return ret;
| }
|
| @@ -266,7 +284,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
| */
| asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
| {
| - struct cr_ctx *ctx;
| + struct cr_ctx *ctx = NULL;
| pid_t pid;
| int ret;
|
| @@ -274,15 +292,17 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
| if (flags)
| return -EINVAL;
|
| - ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
| - if (IS_ERR(ctx))
| - return PTR_ERR(ctx);
| -
| /* FIXME: for now, we use 'crid' as a pid */
| pid = (pid_t) crid;
|
| + if (pid == task_pid_vnr(current))
| + ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
| +
| + if (IS_ERR(ctx))
| + return PTR_ERR(ctx);
| +
| ret = do_restart(ctx, pid);
|
| - cr_ctx_free(ctx);
| + cr_ctx_put(ctx);
| return ret;
| }
| diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
| index c946320..cede30e 100644
| --- a/include/linux/checkpoint.h
| +++ b/include/linux/checkpoint.h
| @@ -12,8 +12,11 @@
|
| #include <linux/path.h>
| #include <linux/fs.h>
| +#include <linux/path.h>
| +#include <linux/sched.h>
| +#include <asm/atomic.h>
|
| -#define CR_VERSION 2
| +#define CR_VERSION 3
|
| struct cr_ctx {
| int crid; /* unique checkpoint id */
| @@ -31,8 +34,7 @@ struct cr_ctx {
| void *hbuf; /* temporary buffer for headers */
| int hpos; /* position in headers buffer */
|
| - struct task_struct **tasks_arr; /* array of all tasks in container */
| - int tasks_nr; /* size of tasks array */
| + atomic_t refcount;
|
| struct cr_objhash *objhash; /* hash for shared objects */
|
| @@ -40,6 +42,19 @@ struct cr_ctx {
| struct list_head pgarr_pool; /* pool of empty page arrays chain */
|
| struct path fs_mnt; /* container root (FIXME) */
| +
| + /* [multi-process checkpoint] */
| + struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
| + int tasks_nr; /* size of tasks array */
| +
| + /* [multi-process restart] */
| + struct cr_hdr_pids *pids_arr; /* array of all pids [restart] */
| + int pids_nr; /* size of pids array */
Nit: Since we already have a pid_nr() that refers to something different,
can we call this 'nr_pids' (and nr_tasks above) like mm_context->nr_threads ?
Of course, there is no convention, so its easy to argue the other way.
Secondly, isn't pids_nr same as tasks_nr ? If so do we need both ?
Or is this intended to address the issue of multiple pid_nr values that a
task in a nested container can have ? If so, pids_nr is > tasks_nr and that
brings up two comments :-)
First, mktree.c and cr_next_task() are using 'ctx->pids_nr' to determine how
many tasks to start. If we are talking about nested containers, pids_nr
will be greater than tasks_nr so, mktree and cr_next_task() should be
use 'ctx->tasks_nr' to determine how many tasks to create. Also if
checkpointing a nested container we should view the multiple nested pid
values a process as an attribute of the task and maybe save them in
cr_write_task() rather than in cr_write_tree().
My second comment is more an orthogonal question. Suppose init_pid_ns = level
0 and we have a container that is nested at level 3. If we checkpoint just
this container, we would want to be able to restore this container at any level
> 0 right ?
| + int pids_pos; /* position pids array */
| + pid_t pids_active; /* pid of (next) active task */
Do we need both pids_pos and pids_active in the ctx ? Can pids_active
just be a local variable in cr_next_task() and cr_wait_task() ?
IOW, isn't this always true
pids_arr[pids_pos] == pids_active
| + atomic_t tasks_count; /* sync of tasks: used to coordinate */
Name is a bit confusing with 'tasks_nr', but the comment helps and I can't
think of a better name.
| + struct completion complete; /* container root and other tasks on */
| + wait_queue_head_t waitq; /* start, end, and restart ordering */
| };
Sukadev
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list