[Devel] [PATCH RH9 11/23] ms/memcg: enable accounting for mnt_cache entries
Vasily Averin
vvs at virtuozzo.com
Sun Sep 26 13:14:06 MSK 2021
Patch series "memcg accounting from OpenVZ", v7.
OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1. Now we're
rebasing again, revising our old patches and trying to push them upstream.
We try to protect the host system from any misuse of kernel memory
allocation triggered by untrusted users inside the containers.
Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:
Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues
- tty objects
- per-mm LDT
We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.
Also we're going to add an accounting for nft, however it is not ready
yet.
We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that they
are at least not worse as the according original RHEL7 kernel.
This patch (of 10):
The kernel allocates ~400 bytes of 'struct mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts, and this
can be repeated many times. Additionally, each mount allocates up to
PATH_MAX=4096 bytes for mnt->mnt_devname.
It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.com
Signed-off-by: Vasily Averin <vvs at virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb at google.com>
Acked-by: Christian Brauner <christian.brauner at ubuntu.com>
Cc: Tejun Heo <tj at kernel.org>
Cc: Michal Hocko <mhocko at kernel.org>
Cc: Johannes Weiner <hannes at cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev at gmail.com>
Cc: Roman Gushchin <guro at fb.com>
Cc: Yutian Yang <nglaive at gmail.com>
Cc: Alexander Viro <viro at zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan at gmail.com>
Cc: Andrei Vagin <avagin at gmail.com>
Cc: Borislav Petkov <bp at alien8.de>
Cc: Dmitry Safonov <0x7f454c46 at gmail.com>
Cc: "Eric W. Biederman" <ebiederm at xmission.com>
Cc: Greg Kroah-Hartman <gregkh at linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa at zytor.com>
Cc: Ingo Molnar <mingo at redhat.com>
Cc: "J. Bruce Fields" <bfields at fieldses.org>
Cc: Jeff Layton <jlayton at kernel.org>
Cc: Jens Axboe <axboe at kernel.dk>
Cc: Jiri Slaby <jirislaby at kernel.org>
Cc: Kirill Tkhai <ktkhai at virtuozzo.com>
Cc: Oleg Nesterov <oleg at redhat.com>
Cc: Serge Hallyn <serge at hallyn.com>
Cc: Thomas Gleixner <tglx at linutronix.de>
Cc: Zefan Li <lizefan.x at bytedance.com>
Cc: Borislav Petkov <bp at suse.de>
Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
(cherry picked from commit 79f6540ba88dfb383ecf057a3425e668105ca774)
https://jira.sw.ru/browse/PSBM-133990
Signed-off-by: Vasily Averin <vvs at virtuozzo.com>
---
fs/namespace.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 97adcb5ab5d5..e51b63ae233b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
goto out_free_cache;
if (name) {
- mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+ mnt->mnt_devname = kstrdup_const(name,
+ GFP_KERNEL_ACCOUNT);
if (!mnt->mnt_devname)
goto out_free_id;
}
@@ -4240,7 +4241,7 @@ void __init mnt_init(void)
int err;
mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
- 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+ 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
mount_hashtable = alloc_large_system_hash("Mount-cache",
sizeof(struct hlist_head),
--
2.25.1
More information about the Devel
mailing list