[Devel] Re: [PATCH 4/5] c/r: checkpoint and restart pids objects
Oren Laadan
orenl at cs.columbia.edu
Sat Feb 5 14:21:43 PST 2011
Suka,
Thanks for the review.
On 02/05/2011 04:43 PM, Sukadev Bhattiprolu wrote:
> Oren:
>
> I am still reviewing this patchset, but have a few questions/comments
> below on this patch.
>
> | From: Oren Laadan <orenl at cs.columbia.edu>
> | Subject: [PATCH 4/5] c/r: checkpoint and restart pids objects
> |
> | Make use of (shared) pids objects instead of simply saving the pid_t
> | numbers in both checkpoint and restart.
> |
> | The motivation for this change is twofold. First, since pid-ns came to
> | life pid's in the kenrel _are_ shared objects and should be treated as
> | such. This is useful e.g. for tty handling and also file-ownership
> | (the latter waiting for this feature). Second, to properly support
> | nested namesapces we need to report with each pid the entire list of
> | pid numbers, not only a single pid. While current we do that for all
> | "live" pids (those that belong to live tasks), we didn't do it for
> | "dead" pids (to be assigned to ghost restarting tasks).
> |
> | Note, that ideally the list of vpids of a pid object should also
> | include the pid-ns to which each level belongs; however, in this patch
> | we don't yet hanlde that. So only linear pid-nesting works well and
> | not arbitrary tree.
> |
> | DICLAIMER: this patch is big and intrusive! Here is a summary of the
> | changes that it makes:
[...]
> | diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> | index 922eff0..c0a548a 100644
> | --- a/include/linux/checkpoint_hdr.h
> | +++ b/include/linux/checkpoint_hdr.h
> | @@ -107,7 +107,9 @@ enum {
> | CKPT_HDR_SECURITY,
> | #define CKPT_HDR_SECURITY CKPT_HDR_SECURITY
> |
> | - CKPT_HDR_TREE = 101,
> | + CKPT_HDR_PIDS = 101,
> | +#define CKPT_HDR_PIDS CKPT_HDR_PIDS
> | + CKPT_HDR_TREE,
> | #define CKPT_HDR_TREE CKPT_HDR_TREE
> | CKPT_HDR_TASK,
> | #define CKPT_HDR_TASK CKPT_HDR_TASK
> | @@ -358,20 +360,32 @@ struct ckpt_hdr_container {
> | */
> | } __attribute__((aligned(8)));;
> |
> | +/* pids array */
> | +struct ckpt_hdr_pids {
> | + struct ckpt_hdr h;
> | + __u32 nr_pids;
> | + __u32 nr_vpids;
> | +} __attribute__((aligned(8)));
>
> For consistency can we call this ckpt_hdr_pids_tree ?
'struct ckpt_hdr_pids' and 'struct ckpt_pids' are related, and
do not provide information about the process tree. See also,
for example, 'struct ckpt_eventpoll_item' and the .._hdr_.. one.
>
> | +
> | +struct ckpt_pids {
> | + __u32 depth;
> | + __s32 numbers[1];
> | +} __attribute__((aligned(8)));
> | +
>
> This actually corresponds to _one_ 'struct pid' right ? Can we rename to
> 'struct ckpt_pid' or ckpt_struct_pid ?
Yes, it corresponds to a single pid object, which has _multiple_
(rather than a single) pids numbers --> hence the name.
>
> | /* task tree */
> | struct ckpt_hdr_tree {
> | struct ckpt_hdr h;
> | - __s32 nr_tasks;
> | + __u32 nr_tasks;
> | } __attribute__((aligned(8)));
>
> And this to, ckpt_hdr_task_tree ?
Ok.
[...]
> | + for (n = 0; n < ctx->nr_tasks; n++) {
> | + task = ctx->tasks_arr[n];
> | +
> | + rcu_read_lock();
> | + pid = get_pid(task_pid(task));
> | + tgid = get_pid(task_tgid(task));
> | + pgrp = get_pid(task_pgrp(task));
> | + session = get_pid(task_session(task));
> | + rcu_read_unlock();
> | +
> | + /*
> | + * How to handle references to pids outside our pid-ns ?
> | + * In container checkpoint, such pids are prohibited, so
> | + * we report an error.
> | + * In subtree checkpoint it is valid, however, we don't
> | + * collect them here to not leak data (it is irrelevant
> | + * to userspace anyway), Instead, in checkpoint_tree() we
> | + * substitute 0 for the such pgrp/session entries.
> | + */
> | +
> | + /* pid */
> | + ret = ckpt_obj_lookup_add(ctx, pid,
> | + CKPT_OBJ_PID, &new);
> | + if (ret >= 0 && new) {
> | + depth += pid->level - root_pidns->level;
>
> 'depth' here was a bit confusing to me. We are really counting of the
> number of vpids. So, can you rename 'depth' to nr_pids ?
So either 'vpids', or 'levels'. The problem with 'nr_pids' is that
it's ambiguous: could be number of pid-objects, or pid-numbers.
>
> (i.e if you find a process with pid and tgid two levels deep, it initially
> appeared that the depth would be 4. But the depth is still 2 and the number
> of vpids is 4 right ?)
Yes, it is summing the depths.
> | + ret = flex_array_put(pids_arr, i++, pid, GFP_KERNEL);
> | + new = 0;
> | + }
> | +
> | + /* tgid: if tgid != pid) */
> | + if (ret >= 0 && tgid != pid)
> | + ret = ckpt_obj_lookup_add(ctx, tgid,
> | + CKPT_OBJ_PID, &new);
> | + if (ret >= 0 && new) {
> | + depth += tgid->level - root_pidns->level;
> | + ret = flex_array_put(pids_arr, i++, tgid, GFP_KERNEL);
> | + new = 0;
> | + }
> | +
> | + /*
> | + * pgrp: if in our pid-namespace, and
> | + * if pgrp != tgid, and if pgrp != root_session
> | + */
> | + if (pid_nr_ns(pgrp, root_pidns) == 0) {
> | + /* pgrp must be ours in container checkpoint */
> | + if (!(ctx->uflags & CHECKPOINT_SUBTREE))
> | + ret = -EBUSY;
> | + } else if (ret >= 0 && pgrp != tgid && pgrp != root_session)
> | + ret = ckpt_obj_lookup_add(ctx, pgrp,
> | + CKPT_OBJ_PID, &new);
> | + if (ret >= 0 && new) {
> | + depth += pgrp->level - root_pidns->level;
> | + ret = flex_array_put(pids_arr, i++, pgrp, GFP_KERNEL);
> | + new = 0;
> | + }
> | +
> | + /*
> | + * session: if in our pid-namespace, and
> | + * if session != tgid, and if session != root_session
> | + */
> | + if (pid_nr_ns(session, root_pidns) == 0) {
> | + /* session must be ours in container checkpoint */
> | + if (!(ctx->uflags & CHECKPOINT_SUBTREE))
> | + ret = -EBUSY;
> | + } else if (ret >= 0 && session != tgid && session != root_session)
> | + ret = ckpt_obj_lookup_add(ctx, session,
> | + CKPT_OBJ_PID, &new);
> | + if (ret >= 0 && new) {
> | + depth += session->level - root_pidns->level;
> | + ret = flex_array_put(pids_arr, i++, session, GFP_KERNEL);
> | + }
> | +
> | + put_pid(pid);
> | + put_pid(tgid);
> | + put_pid(pgrp);
> | + put_pid(session);
>
> We save the pid pointers in the flex_array right ? If we put the references
> here, the pointers in flex_array don't have a reference, so the pid pointer
> access in checkpoint_pids_dump() is unsafe ?
>
> Or is it that the process tree is frozen so the pid won't go away ? If
> so do we need the get_pid() and put_pid() in this function ?
We get a reference inside the rcu_read_lock() so that we could safely
access them after we drop the lock. Then we add each (new) pid to the
objhash - which will take another reference to it. Finally we drop
the local reference no longer needed.
I'll a comment to make it clear.
> | +
> | + if (ret < 0)
> | + break;
> | + }
> | +
> | + *nr_pids = i;
> | + *nr_vpids = depth;
> | +
> | + ckpt_debug("nr_pids = %d, nr_vpids = %d\n", i, depth);
> | + return ret;
> | +}
> | +
> | +static int checkpoint_pids_dump(struct ckpt_ctx *ctx,
> | + struct flex_array *pids_arr,
> | + int nr_pids, int nr_vpids)
> | +{
> | + struct ckpt_hdr_pids *hh;
> | + struct ckpt_pids *h;
> | + struct pid *pid;
> | + char *buf;
> | + int root_level;
> | + int len, pos;
> | + int depth = 0;
>
> Here too, using 'depth' to count nr_vpids is a bit confusing :-)
Ok - will change as above.
>
> | + int i, n = 0;
> | + int ret;
> | +
> | + hh = ckpt_hdr_get_type(ctx, sizeof(*hh), CKPT_HDR_PIDS);
> | + if (!hh)
> | + return -ENOMEM;
> | +
> | + hh->nr_pids = nr_pids;
> | + hh->nr_vpids = nr_vpids;
> | +
> | + ret = ckpt_write_obj(ctx, &hh->h);
> | + ckpt_hdr_put(ctx, hh);
> | + if (ret < 0)
> | + return ret;
> | +
> | + pos = (nr_pids * sizeof(*h)) + (nr_vpids * sizeof(__s32));
> | + ret = ckpt_write_obj_type(ctx, NULL, pos, CKPT_HDR_BUFFER);
> | + if (ret < 0)
> | + return ret;
> | +
> | + buf = ckpt_hdr_get(ctx, PAGE_SIZE);
> | + if (!buf)
> | + return -ENOMEM;
> | +
> | + root_level = ctx->root_nsproxy->pid_ns->level;
> | +
> | + while (n < nr_pids) {
> | + pos = 0;
> | +
> | + rcu_read_lock();
> | + while (1) {
> | + pid = flex_array_get(pids_arr, n);
> | + len = sizeof(*h) + pid->level * sizeof(__s32);
>
> Hmm. pid->level is the global level here right ? So if we checkpoint a
> container 2 levels deep, we don't need to save the vpids for levels 0,1.
> do we ? Or should we s/pid->level/(pid->level - root->level)/ (like
> we do for h->depth below ?
The latter. Good catch !
>
> | +
> | + /* need to flush current buffer ? */
> | + if (pos + len > PAGE_SIZE || n == nr_pids)
> | + break;
> | +
> | + h = (struct ckpt_pids *) &buf[pos];
> | + h->depth = pid->level - root_level;
> | + for (i = 0; i <= h->depth; i++)
> | + h->numbers[i] = pid->numbers[pid->level + i].nr;
> | + depth += h->depth;
> | + pos += len;
> | + n++;
> | + }
> | + rcu_read_unlock();
> | +
> | + /* something must have changed since last count... */
> | + if (depth > nr_vpids) {
> | + ret = -EBUSY;
> | + break;
> | + }
> | +
> | + ret = ckpt_kwrite(ctx, buf, pos);
> | + if (ret < 0)
> | + break;
>
> Do we need to memset(buf, 0, sizeof(buf)) here ? Specially if we expect
> to fill 0s in ancestor pid namespaces (in the above example of
> checkpointing a container 2 levels deep, do we want to write zeros for
> the pid in levels 0,1) ?
We shouldn't need it - assuming we fix the above as noted.
Thanks,
Oren.
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list