[Devel] Re: 2009 kernel summit preparation for 'containers end-game' discussion

Ying Han yinghan at google.com
Tue Oct 6 09:53:09 PDT 2009


On Tue, Oct 6, 2009 at 8:56 AM, Serge E. Hallyn <serue at us.ibm.com> wrote:
> Hi,
>
> the kernel summit is rapidly approaching. One of the agenda
> items is 'the containers end-game and how do we get there.'
> As of now I don't yet know who will be there to represent the
> containers community in that discussion.  I hope there is
> someone planning on that?  In the hopes that there is, here is
> a summary of the info I gathered in June, in case that is
> helpful.  If it doesn't look like anyone will be attending
> ksummit representing containers, then I'll send the final
> version of this info to the ksummit mailing list so that someone
> can stand in.
>
> 1. There will be an IO controller minisummit before KS.  I
> trust someone (Balbir?) will be sending meeting notes to
> the cgroup list, so that highlights can be mentioned at KS?
>
> 2. There was a checkpoint/restart BOF plus talk at plumber's.
> Notes on the BOF are here:
>
https://lists.linux-foundation.org/pipermail/containers/2009-September/020915.html
>
> 3. There was an OOM notification talk or BOF at plumber's.
> Dave or Balbir, are there any notes about that meeting?
Serge:
Here are some notes I took from Dave's OOM talk:

Change the OOM killer's policy.

The current goal of OOM killer is to kill a rogue memory hogging task which
will lead to future memory freeing, and allow the system or container to
resume normal operation. Under OOM condition, kernel scans the tasklist of
the system or container and scores each task based on heuristic mechanism.
The task with highest score is picked to kill. Also kernel provides
/proc/pid/oom_adj API for adding user policy on top of the score, it allows
admin to tune the "badness" on task basis.

Linux Theory: A free page is a wasted page of RAM and Linux will always fill
up memory with disk caches. When we time stamp the running time of an
application, we normally follow the sequence "flush cache - time - run app -
time - flush cache". So being OOM is normal and it is not a bug.

Linux-mm has a list descripting the possible OOM conditions.
http://linux-mm.org/OOM

User Perspectives:
High Performance Computing: I will take as much memory can be given, Please
tell me how much memory that is. In these systems, swapping is the devil.

Enterprise: Applications do their own memory management.If the system gets
lowmem, I want the the kernel to tell me, and I will give some of mine back.
Memory notification system brings up lots of attention. Couple of proposals
have been posted in linux-mm and none of them seems fulfill all the
requirements.

Desktop: This is the OOM designed for. When OpenOffice/Firefox flows up,
please just kill it quickly, i will reopen it in a minute. Besides, Please
don't kill sshd.

Memory Reclaim
If no free memory, we scan the LRU and try to free pages. Recent issues on
page reclaim focuses on scalability. In 1991 with 4M of DRAM, We have 1024
pages to scan. In 2009 with 4G of DRAM, we have1048576 pages to scan. The
increasing of the memory size makes reclaim job harder and harder.

Beat the LRU into shape
* Never run out of memory, never reclaim and never look at the LRU.
* Use large size pagesize. IBM uses 64k page instead of 4k page. "IBM uses
64K page, more on the kernel issue change than userpace change if they use
libc"
* Keep troublesome pages off the LRU lists including unreclaimable pages
(anon, mlock, shm, slab, dirty pages)
and Hugetlbfs which are not counted on RSS.
* Split up the LRU lists. It includes the NUMA implementation as well as the
unevictable patch from Rik (~2.6.28)
What is next:

Having the OOM killer always pick the "right" application to kill is a tough
problem and it has been the hot topic in upstream with several patches
posted. Notification system has lots of attention during the talk, here are
the summary of current posted patches:

Linux killed Kenny, bastard!
Evgeniy Polyakov posted the patch early this year. What the patch does is to
provide an API that admin can specify the oom victim by the process name.
No one likes the patch in linux-mm. The argument is on the current mechanism
of caculating "badness score" which is way complex for admin to determin
which task to kill. Alan Cox simply answered the question: "its
always heuristic", and he also pointed out "What you actually need is
notifiers to work on /proc. In fact containers are probably the right way to
do it".

Cgroup based OOM killer controller
Nikanth Karthikesan re-posted the patch which adding the cgroup support. The
patch added an adjustable value "oom.victim" for each oom cgroup. The OOM
killer would kill all the processes in a cgruop with a higher oom.victim
value before killing a process in a cgroup with lower oom.victim value.
Among those tasks with the same oom.victim value, the usual "badness"
heuristics would be applied.
It is one step further which takes use of the cgroup hireachy for the OOM
killer subsystem. However, the same question had been raised "What is the
difference between oom_adj and this oom.victim to user?". Nikanth answered
to that question "Using this oom.victim users can specify the exact order to
kill processes.". Another word, oom_adj works as a hint to the kernel while
oom_victim gives strict order.

Per-cgroup OOM handler
Ying Han posted the google in-house patch into linux-mm which defers the OOM
kill decisions to userspace. It allows userspace to respond the OOM by
adding nodes, dropping caches, elevating memcg limit or sending signal. An
alternative is to use /dev/mem_notify which David Rientjes proposed in
linux-mm. The idea is similar, instead of waiting on oom_await, userspace
can poll the information during lowmem condition and respond
correspondingly.

Vladislav Buzov posted the patch which extends the memcg by adding the
notification system on system lowmem condition. The feedbacks looks
promising this time, Although there still lots of changes needs to be done.
Discussions focused on the implementation of the notification mechanism.
Balbir Singh mentioned the cgroupstats - a genetlink based mechanism for
event delivery and request/respondse applications. Paul Menage proposed
couple of options including new ioctl on cgroup files, new syscall and new
per-cgroup file.

--Ying Han

>
> 4. The actual title of the KS discussion is 'containers end-game'.
> The containers-specific info I gathered in June was mainly about
> additional resources which we might containerize.  I expect that
> will be useful in helping the KS community decide how far down
> the containerization path they are willing to go - i.e. whether
> we want to call what we have good enough and say you must use kvm
> for anything more, whether we want to be able to provide all the
> features of a full VM with containers, or something in between,
> say targetting specific uses (perhaps only expand on cooperative
> resource management containers).  With that in mind, here are
> some items that were mentioned in June as candidates for
> more containerization work
>
>        1. Cpu hard limits, memory soft limits (Balbir)
>        2. Large pages, mlock, shared page accounting (Balbir)
>        3. Oom notification (Balbir - was anything decided on this
>                at plumber's?)
>        4. There is agreement on getting rid of the ns cgroup,
>                provided that:
>                a. user namespaces can provide container confinement
>                guarantees
>                b. a compatibility flag is created to clone parent
>                cgroup when creating a new cgroup (Paul and Daniel)
>        5. Poweroff/reboot handling in containers (Daniel)
>        6. Full user namespaces to segragate uids in different
>                containers and confine root users in containers, i.e.
>                with respect to file systems like cgroupfs.
>        7. Checkpoint/restart (c/r) will want time virtualization (Daniel)
>        8. C/r will want inode virtualization (Daniel)
>        9. Sunrpc containerization (required to allow multiple
>                containers separate NFS client access to the same server)
>        10. Sysfs tagging, support for physical netifs to migrate
>                network namespaces, and /sys/class/net virtualization
>
> Again the point of this list isn't to ask for discussion about
> whether or how to implement each at this KS, but rather to give
> an idea of how much work is left to do.  Though let the discussion
> lead where it may of course.
>
> I don't have it here, but maybe it would also be useful to
> have a list ready of things we can do today with containerization?
> Both with upstream, and with under-development patchsets.
>
> I also hope that someone will take notes on the ksummit
> discussion to send to the containers and cgroup lists.
> I expect there will be a good LWN writeup, but a more
> containers-focused set of notes will probably be useful
> too.
>
> thanks,
> -serge
> _______________________________________________
> Containers mailing list
> Containers at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
>
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers




More information about the Devel mailing list