[Devel] Re: [PATCH linux-cr] nsproxy: record ambient namespaces

Oren Laadan orenl at cs.columbia.edu
Tue Mar 2 13:20:02 PST 2010



Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl at cs.columbia.edu):
>> Applied.
>>
>> Serge E. Hallyn wrote:
>>> The nsproxy restore path recognizes that an objref of 0 for
>>> ipc or uts ns means don't unshare it.  But the checkpoint side
>>> forgot to write down 0 when the ipc or uts ns isn't unshared!
>>>
>>> Fix that.
>>>
>>> To test, run a program with a private pidns but shared utsns
>>> which does
>>>
>>> 	sleep(5);
>>> 	sethostname("serge", 6);
>>>
>>> checkpoint it, reset your hostname (if you let the program
>>> complete), then restart the program: without this patch, it
>>> will not reset your hostname.  It should, and with this patch
>>> it will.
>>>
>>> Signed-off-by: Serge E. Hallyn <serue at us.ibm.com>
>>> ---
>>> kernel/nsproxy.c |   19 +++++++++++++------
>>> 1 files changed, 13 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
>>> index 0da0d83..dcb502c 100644
>>> --- a/kernel/nsproxy.c
>>> +++ b/kernel/nsproxy.c
>>> @@ -280,13 +280,20 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
>>> 	if (!h)
>>> 		return -ENOMEM;
>>> -	ret = checkpoint_obj(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
>>> -	if (ret <= 0)
>>> -		goto out;
>>> +	ret = 0;
>>> +	if (nsproxy->uts_ns != ctx->root_nsproxy->uts_ns) {
>>> +		ret = checkpoint_obj(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
>>> +		if (ret <= 0)
>>> +			goto out;
>>> +	}
>>> 	h->uts_objref = ret;
>>> -	ret = checkpoint_obj(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
>>> -	if (ret < 0)
>>> -		goto out;
>>> +
>>> +	ret = 0;
>>> +	if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns) {
>>> +		ret = checkpoint_obj(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
>>> +		if (ret < 0)
>>> +			goto out;
>>> +	}
>>> 	h->ipc_objref = ret;
>>> 	/* FIXME: for now, only marked visited to pacify leaks */
> 
> All right, tihs patch was not right.  What we should be checking
> is whether nsproxy->uts_ns != ctx->root_task->parent->nsproxy->uts_ns.
> But I don't want to just send the patch to do that until we discuss
> whether that is the right thing to do.
> 
> Let me give a precise definition:  I call an 'ambient namespace' a
> namespace which was not unshared when the container was created.
> Unfortunately there isn't really a reliable way to tell whether that
> was the case.  Checking container_init->parent may depend upon the
> container init not having been reparented.

Hmm... yeah, I should have looked at it more carefully -

My original idea is that someone (e.g. userspace) could zero out,
e.g., the h->uts_objref, and that way allow a restart to "inherit"
the uts-ns of the parent.

I didn't not do it at checkpoint because (a) I wanted to allow
flexibility by letting the user choose later, and (b) as you pointed
out already, it's hard to figure out this property at checkpoint
anyway.

Using a leak detection is tricky, because if we are doing full
container checkpoint, we disallow leaks anyway, and if we are
doing a subtree, then leaks are allowed.

> 
> So as I see it we can do three things:
> 
> 1. always unshare any namespace which was not empty at checkpoint.
> So if the container was not unshared from host, and we checkpoint
> members of that namespace, then at restart we will restart in an
> unshared namespace and recreate the objects.  That basically means
> undo the patch I originally sent.
> 
> That means that if the restarted task does 'hostname' it may end
> up not affecting the hosts's hostname, even if it was originally
> started on the host without separate utsns.  Maybe that's what we
> want?
> 

This is the default we have used so far, and I'm quite happy with
it.

> 2. use the simple 'nsproxy->uts_ns != ctx->root_task->parent->nsproxy->uts_ns'
> test.  I think that would be pretty reliable.
> 
> 3. for each namespace in ctx->root_nsproxy, check whether there are
> any leaks, and, if so, mark it in the checkpoing image header so that
> we can give restart a hint that it might not want to unshare those.

If we leave some work to userspace anyway, an alternative to doing
the accounting in the kernel (remember: this scenario only makes
sense for non-container checkpoint), is to simply also save the ref
count of the nsproxy with the nsproxy data (not even each namespace).
Then user space can figure out if there is a "leak".

Finally, if we do want to allow such a leak (e.g. only the uts-ns
of the root) in a full container checkpoint, then we will need some
way (flag ?) to request that when doing the checkpoint.

So for now, I simply revert the patch (unless you object).

Oren.
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers




More information about the Devel mailing list