[Devel] Re: [PATCH linux-cr] nsproxy: record ambient namespaces
Oren Laadan
orenl at cs.columbia.edu
Tue Mar 2 13:20:02 PST 2010
Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl at cs.columbia.edu):
>> Applied.
>>
>> Serge E. Hallyn wrote:
>>> The nsproxy restore path recognizes that an objref of 0 for
>>> ipc or uts ns means don't unshare it. But the checkpoint side
>>> forgot to write down 0 when the ipc or uts ns isn't unshared!
>>>
>>> Fix that.
>>>
>>> To test, run a program with a private pidns but shared utsns
>>> which does
>>>
>>> sleep(5);
>>> sethostname("serge", 6);
>>>
>>> checkpoint it, reset your hostname (if you let the program
>>> complete), then restart the program: without this patch, it
>>> will not reset your hostname. It should, and with this patch
>>> it will.
>>>
>>> Signed-off-by: Serge E. Hallyn <serue at us.ibm.com>
>>> ---
>>> kernel/nsproxy.c | 19 +++++++++++++------
>>> 1 files changed, 13 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
>>> index 0da0d83..dcb502c 100644
>>> --- a/kernel/nsproxy.c
>>> +++ b/kernel/nsproxy.c
>>> @@ -280,13 +280,20 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
>>> if (!h)
>>> return -ENOMEM;
>>> - ret = checkpoint_obj(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
>>> - if (ret <= 0)
>>> - goto out;
>>> + ret = 0;
>>> + if (nsproxy->uts_ns != ctx->root_nsproxy->uts_ns) {
>>> + ret = checkpoint_obj(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
>>> + if (ret <= 0)
>>> + goto out;
>>> + }
>>> h->uts_objref = ret;
>>> - ret = checkpoint_obj(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
>>> - if (ret < 0)
>>> - goto out;
>>> +
>>> + ret = 0;
>>> + if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns) {
>>> + ret = checkpoint_obj(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
>>> + if (ret < 0)
>>> + goto out;
>>> + }
>>> h->ipc_objref = ret;
>>> /* FIXME: for now, only marked visited to pacify leaks */
>
> All right, tihs patch was not right. What we should be checking
> is whether nsproxy->uts_ns != ctx->root_task->parent->nsproxy->uts_ns.
> But I don't want to just send the patch to do that until we discuss
> whether that is the right thing to do.
>
> Let me give a precise definition: I call an 'ambient namespace' a
> namespace which was not unshared when the container was created.
> Unfortunately there isn't really a reliable way to tell whether that
> was the case. Checking container_init->parent may depend upon the
> container init not having been reparented.
Hmm... yeah, I should have looked at it more carefully -
My original idea is that someone (e.g. userspace) could zero out,
e.g., the h->uts_objref, and that way allow a restart to "inherit"
the uts-ns of the parent.
I didn't not do it at checkpoint because (a) I wanted to allow
flexibility by letting the user choose later, and (b) as you pointed
out already, it's hard to figure out this property at checkpoint
anyway.
Using a leak detection is tricky, because if we are doing full
container checkpoint, we disallow leaks anyway, and if we are
doing a subtree, then leaks are allowed.
>
> So as I see it we can do three things:
>
> 1. always unshare any namespace which was not empty at checkpoint.
> So if the container was not unshared from host, and we checkpoint
> members of that namespace, then at restart we will restart in an
> unshared namespace and recreate the objects. That basically means
> undo the patch I originally sent.
>
> That means that if the restarted task does 'hostname' it may end
> up not affecting the hosts's hostname, even if it was originally
> started on the host without separate utsns. Maybe that's what we
> want?
>
This is the default we have used so far, and I'm quite happy with
it.
> 2. use the simple 'nsproxy->uts_ns != ctx->root_task->parent->nsproxy->uts_ns'
> test. I think that would be pretty reliable.
>
> 3. for each namespace in ctx->root_nsproxy, check whether there are
> any leaks, and, if so, mark it in the checkpoing image header so that
> we can give restart a hint that it might not want to unshare those.
If we leave some work to userspace anyway, an alternative to doing
the accounting in the kernel (remember: this scenario only makes
sense for non-container checkpoint), is to simply also save the ref
count of the nsproxy with the nsproxy data (not even each namespace).
Then user space can figure out if there is a "leak".
Finally, if we do want to allow such a leak (e.g. only the uts-ns
of the root) in a full container checkpoint, then we will need some
way (flag ?) to request that when doing the checkpoint.
So for now, I simply revert the patch (unless you object).
Oren.
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list