[Devel] [PATCH RHEL COMMIT] ms/memcg: enable accounting for mnt_cache entries

Konstantin Khorenko khorenko at virtuozzo.com
Tue Sep 28 14:16:20 MSK 2021


The commit is pushed to "branch-rh9-5.14.vz9.1.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after ark-5.14
------>
commit 17dd5c77260294d31e3c162b4a57da2f2525bc5c
Author: Vasily Averin <vvs at virtuozzo.com>
Date:   Tue Sep 28 14:16:19 2021 +0300

    ms/memcg: enable accounting for mnt_cache entries
    
    Patch series "memcg accounting from OpenVZ", v7.
    
    OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
    Initially we used our own accounting subsystem, then partially committed
    it to upstream, and a few years ago switched to cgroups v1.  Now we're
    rebasing again, revising our old patches and trying to push them upstream.
    
    We try to protect the host system from any misuse of kernel memory
    allocation triggered by untrusted users inside the containers.
    
    Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
    list, though I would be very grateful for any comments from maintainersi
    of affected subsystems or other people added in cc:
    
    Compared to the upstream, we additionally account the following kernel objects:
    - network devices and its Tx/Rx queues
    - ipv4/v6 addresses and routing-related objects
    - inet_bind_bucket cache objects
    - VLAN group arrays
    - ipv6/sit: ip_tunnel_prl
    - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
    - nsproxy and namespace objects itself
    - IPC objects: semaphores, message queues and share memory segments
    - mounts
    - pollfd and select bits arrays
    - signals and posix timers
    - file lock
    - fasync_struct used by the file lease code and driver's fasync queues
    - tty objects
    - per-mm LDT
    
    We have an incorrect/incomplete/obsoleted accounting for few other kernel
    objects: sk_filter, af_packets, netlink and xt_counters for iptables.
    They require rework and probably will be dropped at all.
    
    Also we're going to add an accounting for nft, however it is not ready
    yet.
    
    We have not tested performance on upstream, however, our performance team
    compares our current RHEL7-based production kernel and reports that they
    are at least not worse as the according original RHEL7 kernel.
    
    This patch (of 10):
    
    The kernel allocates ~400 bytes of 'struct mount' for any new mount.
    Creating a new mount namespace clones most of the parent mounts, and this
    can be repeated many times.  Additionally, each mount allocates up to
    PATH_MAX=4096 bytes for mnt->mnt_devname.
    
    It makes sense to account for these allocations to restrict the host's
    memory consumption from inside the memcg-limited container.
    
    Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.com
    Signed-off-by: Vasily Averin <vvs at virtuozzo.com>
    
    Reviewed-by: Shakeel Butt <shakeelb at google.com>
    Acked-by: Christian Brauner <christian.brauner at ubuntu.com>
    Cc: Tejun Heo <tj at kernel.org>
    Cc: Michal Hocko <mhocko at kernel.org>
    Cc: Johannes Weiner <hannes at cmpxchg.org>
    Cc: Vladimir Davydov <vdavydov.dev at gmail.com>
    Cc: Roman Gushchin <guro at fb.com>
    Cc: Yutian Yang <nglaive at gmail.com>
    Cc: Alexander Viro <viro at zeniv.linux.org.uk>
    Cc: Alexey Dobriyan <adobriyan at gmail.com>
    Cc: Andrei Vagin <avagin at gmail.com>
    Cc: Borislav Petkov <bp at alien8.de>
    Cc: Dmitry Safonov <0x7f454c46 at gmail.com>
    Cc: "Eric W. Biederman" <ebiederm at xmission.com>
    Cc: Greg Kroah-Hartman <gregkh at linuxfoundation.org>
    Cc: "H. Peter Anvin" <hpa at zytor.com>
    Cc: Ingo Molnar <mingo at redhat.com>
    Cc: "J. Bruce Fields" <bfields at fieldses.org>
    Cc: Jeff Layton <jlayton at kernel.org>
    Cc: Jens Axboe <axboe at kernel.dk>
    Cc: Jiri Slaby <jirislaby at kernel.org>
    Cc: Kirill Tkhai <ktkhai at virtuozzo.com>
    Cc: Oleg Nesterov <oleg at redhat.com>
    Cc: Serge Hallyn <serge at hallyn.com>
    Cc: Thomas Gleixner <tglx at linutronix.de>
    Cc: Zefan Li <lizefan.x at bytedance.com>
    Cc: Borislav Petkov <bp at suse.de>
    Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
    (cherry picked from commit 79f6540ba88dfb383ecf057a3425e668105ca774)
    https://jira.sw.ru/browse/PSBM-133990
    Signed-off-by: Vasily Averin <vvs at virtuozzo.com>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index f43cad544089..53efc3285027 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -205,7 +205,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4242,7 +4243,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),


More information about the Devel mailing list