[Devel] [PATCH 1/8] memcg: export kmemcg cache id via cgroup fs

Mon Feb 3 03:04:26 PST 2014

On Mon, 3 Feb 2014, Vladimir Davydov wrote:

> AFAIU, cgroup identifiers dumped on oom (cgroup paths, currently) and
> memcg slab cache names serve for different purposes.

Sure, you may dump the name for a number of legitimate reasons, but the 
problem still exists that it's difficult to determine what memcg is being 
referenced without a flat hierarchy and unique memcg names for all 
children.

> The point is oom is
> a perfectly normal situation for the kernel, and info dumped to dmesg is
> for admin to find out the cause of the problem (a greedy user or
> cgroup).

Hmm, so if we hand out top-level memcgs to individual jobs or users, like 
our userspace does, and they are able to configure their child memcgs as 
they wish, and then they or the admin finds in the kernel log that a 
memory hog was killed from the memcg with the perfectly anonymous memcg 
name of "memcg", how do we determine what job or user triggered that kill?  
User id is not going to be conclusive in a production environment with 
shared user accounts.

> On the other hand, slab cache names are dumped to dmesg only on
> extraordinary situations - like bugs in slab implementation, or double
> free, or detected memory leaks - where we usually do not need the name
> of the memcg that triggered the problem, because the bug is likely to be
> in the kernel subsys using the cache.

There's certainly overlap here since slab leaks triggered by a particular 
workload, perhaps by usage of a particular syscall, can occur and cause 
oom killing but the problem remains that neither the memcg name nor the 
slab cache name may be conclusive to determine what job or user triggered 
the issue.  That's why we make strict demands that memcg names are always 
unique and encode several key values to identify the user and job and we 
don't rely on the parent.

I can also see the huge maintenance burden it would be to keep around a 
mapping of kmem ids to {user, job} pairs just in case we later identify a 
problem and in 99% of the cases would be just wasted storage.

> Plus, the names are exported to
> sysfs in case of slub, again for debugging purposes, AFAIK. So IMO the
> use cases for oom vs slab names are completely different - information
> vs debugging - and I want to export kmem.id only for the ability of
> debugging kmemcg and slab subsystems.
> 

Eeek, I'm not sure I agree.  I've often found that reproducing rare slab 
issues is very difficult without knowledge of the workload so that I can 
reproduce it.  Whereas X is a very large number of machines and we see 
this issue on 0.0001% of X machines, I would be required to enable this 
"debugging" aid unconditionally to ever be able to map the stored kmem id 
back to a user and job, that mapping would be extremely costly to 
maintain, and we've gained nothing if we had already demanded that 
userspace identify their memcg names with unique identifiers regardless of 
where they are in the hierarchy.