[Devel] Re: [PATCH v5 3/3] cgroups: make procs file writable
KAMEZAWA Hiroyuki
kamezawa.hiroyu at jp.fujitsu.com
Mon Dec 27 01:18:01 PST 2010
On Mon, 27 Dec 2010 03:42:57 -0500
Ben Blum <bblum at andrew.cmu.edu> wrote:
> On Mon, Dec 27, 2010 at 04:42:07PM +0900, KAMEZAWA Hiroyuki wrote:
> > Okay, then, problem is cpuset, which allocates memory.
> > (But I feel that limitation -never_sleep- is not very good.)
> >
> > BTW, mem cgroup chesk mm_owner(mm) == p to decide to take mmap_sem().
> > Your code does moving "thread_group_leader()". IIUC, this isn't guaranteed to
> > be the same. I wonder we can see problem with some stupid userland.
>
> I'm not sure I understand the point of looking at mm->owner? My code
> does thread_group_leader() because if a task that's not the leader calls
> exec(), the thread_group list will be in an indeterminate state from the
> perspective of the task we hold on to (who was originally the leader).
>
Hm. By following, only 'thread-group-leader' can go ahead.
+ threadgroup_fork_write_lock(leader);
+ read_lock(&tasklist_lock);
+ /* sanity check - if we raced with de_thread, we must abort */
+ if (!thread_group_leader(leader)) {
+ retval = -EAGAIN;
+ read_unlock(&tasklist_lock);
+ threadgroup_fork_write_unlock(leader);
+ goto list_teardown;
+ }
This moves each tasks under tasklist_lock().
+ list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+ /* leave current thread as it is if it's already there */
+ oldcgrp = task_cgroup_from_root(tsk, root);
+ if (cgrp == oldcgrp)
+ continue;
+ /* attach each task to each subsystem */
+ for_each_subsys(root, ss) {
+ if (ss->attach_task)
+ ss->attach_task(cgrp, tsk);
+ }
+ /* we don't care whether these threads are exiting */
+ retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+ BUG_ON(retval != 0 && retval != -ESRCH);
+ }
Hm. I'm not sure there can be a task, tsk == mm->owner here.
If tsk == mm->owner, attach task will take mmap_sem() and do something more.
In usual case, mm->owner = thread_group_leader, but it's not in special cases.
> > > Also, the tasklist_lock isn't used to synchronize fork events; that's
> > > what threadgroup_fork_lock (an rwsem) is for. It's for protecting the
> > > thread-group list while iterating over it - so that a race with exec()
> > > (de_thread(), specifically) doesn't change the threadgroup state. It's
> > > basically never safe to iterate over leader->thread_group unless you're
> > > in a no-sleep section.
> > >
> >
> > Thank you for explanation.
> >
> > At first, cpuset should modify settings around 'mm' only when the thread == mm->owner.
> > ....is there any reason to allow all threads can affect the 'mm' ?
>
> I don't follow... is this any different from the cgroup_attach_task()
> case?
>
cpuset_attach() takes mmap_sem() to modify mm's settings. Because all threads
share mm, doing page migration whenever a thread moves seems to be wrong.
I think only a thread, thread-group-leader or mm->owner, in process should be
allowed to migrate pages. Then, cgroup_attach_proc() can avoid taking mmap_sem
under tasklist_lock.
> > About nodemask, I feel what's required is pre_pre_attach() to cache required memory
> > before attaching as radix_tree_preload(). Then, subsys can prepare for creating
> > working area.
>
> To save on global memory footprint, we could pre-allocate two nodemasks,
> but I feel like it's not worth the increase in code complexity. This
> would need to be done in the other cases that unsafely do NODEMASK_ALLOC
> too... too much to keep track of for little gain.
>
But NODEMASK_ALLOC cannot be on stack when we consider 4096node systems.
> > Hmm...but I wonder de_thread() should take threadgroup_fork_write_unlock().
> >
> > I may not understand anything important but I feel taking tasklist_lock() is overkill.
>
> Would rcu_read_lock() be any better? Everywhere else in the kernel that
> iterates over all threads in a group uses either rcu_read_lock or
> tasklist_lock.
>
Because it means that can_attach()/attach() cannot sleep, it seems to make no
difference.
I wonder.... if you stops all clone()/fork() of the proc in moving, you can use
find_ge_pid(). Please see next_tgid() or next_tid() in fs/proc/base.c which
implements scanning tasklist with sleep. Can't you use next_tid() ?
Thanks,
-Kame
P.S. I'll be offlined until 2010/01/04.
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list