[Devel] Re: [PATCH] [RFC] c/r: Add UTS support
Serge E. Hallyn
serge at hallyn.com
Sat Mar 21 07:51:00 PDT 2009
Quoting Eric W. Biederman (ebiederm at xmission.com):
> "Serge E. Hallyn" <serge at hallyn.com> writes:
>
> > Quoting Eric W. Biederman (ebiederm at xmission.com):
> >> > What is wrong with Alexey's patch, which simply passes in the values
> >> > themselves? Do you have another use in mind for the min/max pid
> >> > values?
> >>
> >> At an implementation level (and I need to look at Alexey's specific patch)
> >> every patch I have seen to date creates their own version of alloc_pidmap.
> >
> > You're right, Alexey's patch creates a new one.
> >
> >> alloc_pidmap already implicitly takes min/max and first value to try
> >> as parameters. RESERVED_PIDS, pid_max, and pid_ns->last_pid. So
> >> instead of rewriting alloc_pidmap we should just be able to refactor
> >> alloc_pidmap to take the requisite values. That should be less code
> >> and easier to maintain.
> >
> > Yeah, that sounds good actually. Thanks.
> >
> >> Looking at the current implementation we also have the issue that
> >> pid_max is not per pid namespace. Where it seems to belong.
> >
> > Eh. It does seem to, but otoh why give userspace knobs it has no use
> > for... Or, can you think of a case where it'd be useful?
>
> In general the number of usable pid numbers should be larger in the outer
> pid namespace than in the child pid namespace. Otherwise it is possible
> for the child to eat all of the possible pid numbers.
>
> So I think it would be advantageous for to make containers designed to migrate
> to have a small pid_max by default so we know we won't overwhelm others.
>
> Furthermore since pid_max is a limit on the identifiers allocated no on the
> number of processes it is very much a pid namespace property.
Right, I don't argue that it doesn't seem to belong there. Well if
you think people would use it, it does seem simple enough to do.
Untested (well compile-tested) patch below just for grins.
> >> > I think that's a good guideline, bad rule. Certainly possible
> >> > that you're right that this is just pointing to in-kernel
> >> > recreation of process tree as the way to go. I was getting
> >> > that feeling myself, but then there are still very good reasons
> >> > not to do that, as there are things which each task should do
> >> > before completing sys_restart() which are best done in userspace.
> >> > These include for instance creating virtual nics, and calling
> >> > Oren's suggested 'cr_advise()' system calls.
> >>
> >> You might be right. I am behind on that part of the conversation.
> >>
> >> My general concern is that dividing up the responsibilities between user space
> >> and kernel space seems harder to maintain, and refactor if we don't get something
> >> right the first time.
> >
> > So far we're actually still at the point where the code (Oren's set)
> > could go either way. A small patch from Alexey can make it swing toward
> > kernel, while Oren's mktree.c userspace restart program swings the other
> > way.
> >
> > And since we're punting on any nested namespaces it actually may stay that way
> > for awhile.
>
> Interesting. That sounds fairly fundamental. If I have some free time I will
> have to take a look. I'm in favor of a kernel/user space cooperation but I don't
> currently see the benefit of fork processes in user space.
All right I'll wait for you to take a look, rather than repeat
myself :) The biggest concern IMO is how to create complicated
resources (like a veth tunnel pair) in the kernel case.
thanks,
-serge
>From 47303d729ec494add03fbddb47fac9a020d65f00 Mon Sep 17 00:00:00 2001
From: Serge Hallyn <serue at us.ibm.com>
Date: Sat, 21 Mar 2009 09:22:26 -0500
Subject: [PATCH 1/1] pid_ns: make pid_max a pid_ns property
Remove the pid_max global, and make it a property of the
pid_namespace. When a pid_ns is created, it inherits
the parent's pid_ns.
Fixing up sysctl (trivial akin to ipc version, but
potentially tedious to get right for all CONFIG*
combinations) is left for later.
Signed-off-by: Serge Hallyn <serue at us.ibm.com>
---
include/linux/pid_namespace.h | 1 +
kernel/pid.c | 14 +++++++-------
kernel/pid_namespace.c | 6 ++++--
kernel/sysctl.c | 4 ++--
4 files changed, 14 insertions(+), 11 deletions(-)
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 38d1032..fd7f497 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -30,6 +30,7 @@ struct pid_namespace {
#ifdef CONFIG_BSD_PROCESS_ACCT
struct bsd_acct_struct *bacct;
#endif
+ int pid_max;
};
extern struct pid_namespace init_pid_ns;
diff --git a/kernel/pid.c b/kernel/pid.c
index 1b3586f..898fa8b 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -43,8 +43,6 @@ static struct hlist_head *pid_hash;
static int pidhash_shift;
struct pid init_struct_pid = INIT_STRUCT_PID;
-int pid_max = PID_MAX_DEFAULT;
-
#define RESERVED_PIDS 300
int pid_max_min = RESERVED_PIDS + 1;
@@ -78,6 +76,7 @@ struct pid_namespace init_pid_ns = {
.last_pid = 0,
.level = 0,
.child_reaper = &init_task,
+ .pid_max = PID_MAX_DEFAULT,
};
EXPORT_SYMBOL_GPL(init_pid_ns);
@@ -128,11 +127,12 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
struct pidmap *map;
pid = last + 1;
- if (pid >= pid_max)
+ if (pid >= pid_ns->pid_max)
pid = RESERVED_PIDS;
offset = pid & BITS_PER_PAGE_MASK;
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
- max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+ max_scan = (pid_ns->pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE
+ - !offset;
for (i = 0; i <= max_scan; ++i) {
if (unlikely(!map->page)) {
void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
@@ -164,11 +164,11 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
* bitmap block and the final block was the same
* as the starting point, pid is before last_pid.
*/
- } while (offset < BITS_PER_PAGE && pid < pid_max &&
- (i != max_scan || pid < last ||
+ } while (offset < BITS_PER_PAGE && pid < pid_ns->pid_max
+ && (i != max_scan || pid < last ||
!((last+1) & BITS_PER_PAGE_MASK)));
}
- if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+ if (map < &pid_ns->pidmap[(pid_ns->pid_max-1)/BITS_PER_PAGE]) {
++map;
offset = 0;
} else {
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index fab8ea8..1ba3970 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -67,15 +67,17 @@ err_alloc:
return NULL;
}
-static struct pid_namespace *create_pid_namespace(unsigned int level)
+static struct pid_namespace *create_pid_namespace(struct pid_namespace *old)
{
struct pid_namespace *ns;
+ unsigned int level = old->level + 1;
int i;
ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL);
if (ns == NULL)
goto out;
+ ns->pid_max = old->pid_max;
ns->pidmap[0].page = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!ns->pidmap[0].page)
goto out_free;
@@ -125,7 +127,7 @@ struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old
if (flags & CLONE_THREAD)
goto out_put;
- new_ns = create_pid_namespace(old_ns->level + 1);
+ new_ns = create_pid_namespace(old_ns);
if (!IS_ERR(new_ns))
new_ns->parent = get_pid_ns(old_ns);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c5ef44f..8af16bd 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -48,6 +48,7 @@
#include <linux/acpi.h>
#include <linux/reboot.h>
#include <linux/ftrace.h>
+#include <linux/pid_namespace.h>
#include <asm/uaccess.h>
#include <asm/processor.h>
@@ -74,7 +75,6 @@ extern int max_threads;
extern int core_uses_pid;
extern int suid_dumpable;
extern char core_pattern[];
-extern int pid_max;
extern int min_free_kbytes;
extern int pid_max_min, pid_max_max;
extern int sysctl_drop_caches;
@@ -643,7 +643,7 @@ static struct ctl_table kern_table[] = {
{
.ctl_name = KERN_PIDMAX,
.procname = "pid_max",
- .data = &pid_max,
+ .data = &init_pid_ns.pid_max,
.maxlen = sizeof (int),
.mode = 0644,
.proc_handler = &proc_dointvec_minmax,
--
1.5.6.3
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list