[Devel] [PATCH RHEL COMMIT] memcg: add oom_guarantee

Fri Sep 24 15:04:55 MSK 2021

The commit is pushed to "branch-rh9-5.14.vz9.1.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after ark-5.14
------>
commit c29e40f31f3393088fda136d22c09ca376660493
Author: Vladimir Davydov <vdavydov.dev at gmail.com>
Date:   Fri Sep 24 15:04:55 2021 +0300

    memcg: add oom_guarantee
    
    Feature: mm: OOM guarantee
    
    This patch description:
    
    OOM guarantee works exactly like low limit, but for OOM, i.e. tasks
    inside cgroups above the limit are killed first.
    
    Read/write via memory.oom_guarantee.
    
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
    
    Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
    
    oom: rework logic behind memory.oom_guarantee
    
    Rebase to RHEL 7.2 based kernel:
    https://jira.sw.ru/browse/PSBM-42320
    ===
    From: Vladimir Davydov <vdavydov at parallels.com>
    
    Patchset description: oom enhancements - part 2
    
     - Patches 1-2 prepare memcg for upcoming changes in oom design.
     - Patch 3 reworks oom locking design so that the executioner waits for
       victim to exit. This is necessary to increase oom kill rate, which is
       essential for berserker mode.
     - Patch 4 drops unused OOM_SCAN_ABORT
     - Patch 5 introduces oom timeout.
       https://jira.sw.ru/browse/PSBM-38581
     - Patch 6 makes oom fairer when it comes to selecting a victim among
       different containers.
       https://jira.sw.ru/browse/PSBM-37915
     - Patch 7 prepares oom for introducing berserker mode
     - Patch 8 resurrects oom berserker mode, which is supposed to cope with
       actively forking processes.
       https://jira.sw.ru/browse/PSBM-17930
    
    https://jira.sw.ru/browse/PSBM-26973
    
    Changes in v3:
     - rework oom_trylock (patch 3)
     - select exiting process instead of aborting oom scan so as not to keep
       busy-waiting for an exiting process to exit (patches 3, 4)
     - cleanup oom timeout handling + fix stuck process trace dumped
       multiple times on timeout (patch 5)
     - set max_overdraft to ULONG_MAX on selected processes (patch 6)
     - rework oom berserker process selection logic (patches 7, 8)
    
    Changes in v2:
     - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4)
     - propagate victim to the context that initiated oom in oom_unlock
       (patch 6)
     - always set oom_end on releasing oom context (patch 6)
    
    Vladimir Davydov (8):
      memcg: add mem_cgroup_get/put helpers
      memcg: add lock for protecting memcg->oom_notify list
      oom: rework locking design
      oom: introduce oom timeout
      oom: drop OOM_SCAN_ABORT
      oom: rework logic behind memory.oom_guarantee
      oom: pass points and overdraft to oom_kill_process
      oom: resurrect berserker mode
    
    Reviewed-by: Kirill Tkhai <ktkhai at odin.com>
    
    ======
    This patch set adds memory.oom_guarantee file to memory cgroup which
    allows to protect a memory cgroup from OOM killer. It works as follows:
    OOM killer first selects from processes in cgroups that are above their
    OOM guarantee, and only if there is no such it switches to scanning
    processes from all cgroups. This behavior is similar to UB_OOMGUARPAGES.
    
    It also adds OOM kills counter to each memory cgroup and synchronizes
    beancounters' UB_OOMGUARPAGES resource with oom_guarantee/oom_kill_cnt
    obtained from mem_cgroup.
    
    Related to https://jira.sw.ru/browse/PSBM-20089
    
    =========================================
    This patch description:
    
    Currently, memory.oom_guarantee works as a threshold: we first select
    processes in cgroups whose usage is below oom guarantee, and only if
    there is no eligible process in such cgroups, we disregard oom guarantee
    configuration and iterate over all processes. Although simple to
    implement, such a behavior is unfair: we do not differentiate between
    cgroups that only slightly above their guarantee and those who exceed it
    significantly.
    
    This patch therefore reworks the way how memory.oom_guarantee affects
    oom killer behavior. First of all, it reverts old logic, which was
    introduced by commit e94e18346f74c ("memcg: add oom_guarantee"), leaving
    hunks bringing the memory.oom_guarantee knob intact. Then it implements
    a new approach of selecting oom victim that works as follows.
    
    Now a task is selected by oom killer iff (a) the memory cgroup which the
    process resides in has the greatest overdraft of all cgroups eligible
    for scan and (b) the process has the greatest score among all processes
    which reside in cgroups with the greatest overdraft. A cgroup's
    overdraft is defined as
    
      (U-G)/(L-G), if U<G,
      0,           otherwise
    
      G - memory.oom_guarantee
      L - memory.memsw.limit_in_bytes
      U - memory.memsw.usage_in_bytes
    
    https://jira.sw.ru/browse/PSBM-37915
    
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
    
    Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
    
    Conflicts:
            mm/memcontrol.c
    
    +++
    mm, memcg, oom_gurantee: change memcg oom overdraft formula
    
    Currently our oom killer kill tasks from cgroup with max overdraft.
    Overdraft formula looks like this "usage/(gurantee + 1)". Which makes
    all cgroups without gurantee (default) to be a first candidates for oom
    kill since "usage_cgrp1"/"hundreds of megabytes" always < usage_cgrp2/1.
    
    Change overdraft formula to simple "usage > guarantee ? usage - guarantee : 0".
    
    Unrelated note: oom_guarantee is 0 by default and not inherited from parent
    cgroup. Not going to change it since currently ther is no necessity for it.
    Also this would require from userspace to make sure that oom_guarantee set
    for all sub cgroups of cgroups like machine.slice wich has unreacheable high
    oom_guarantee and we don't want this high guarantee on sub groups.
    
    https://pmc.acronis.com/browse/VSTOR-22575
    
    Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
    
    Reviewed-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    https://jira.sw.ru/browse/PSBM-127846
    (cherry-picked from vz7 commit 0996bb7c7837 ("mm, memcg, oom_gurantee:
    change memcg oom overdraft formula"))
    
    Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
    
    +++
    oom: Fix task selection in oom_evaluate_task()
    
    It was observed that, when OOM happened, OOM killer did not target the
    "fattest" tasks first. It might have killed half of a CT's processes before
    killing the tasks that actually consumed lots of memory.
    
    This happened because the result of oom_worse() was ignored in
    oom_evaluate_task(): a new task was selected even if was not worse than
    the previously chosen one.
    
    This patch fixes it.
    
    https://jira.sw.ru/browse/PSBM-132385
    
    Signed-off-by: Evgenii Shatokhin <eshatokhin at virtuozzo.com>
    
    Rebased to vz9:
     - changed patch a bit since oom badness has become signed
    
    (cherry picked from vz8 commit 1912097bf33de29a5caacb206eb03792085391cd)
    Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>
---
 fs/proc/base.c             |  2 +-
 include/linux/memcontrol.h | 27 ++++++++++++++++++++++++
 include/linux/oom.h        | 19 ++++++++++++++++-
 mm/memcontrol.c            | 51 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c              | 15 ++++++++++++--
 5 files changed, 110 insertions(+), 4 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index e5b5f7709d48..ff7b2776d7ae 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -550,7 +550,7 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long points = 0;
 	long badness;
 
-	badness = oom_badness(task, totalpages);
+	badness = oom_badness(task, totalpages, NULL);
 	/*
 	 * Special case OOM_SCORE_ADJ_MIN for all others scale the
 	 * badness value into [0, 2000] range which we have been
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a9636680f207..55313dba4847 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -262,6 +262,9 @@ struct mem_cgroup {
 	/* vmpressure notifications */
 	struct vmpressure vmpressure;
 
+	unsigned long overdraft;
+	unsigned long long oom_guarantee;
+
 	/*
 	 * Should the OOM killer kill all belonging tasks, had it kill one?
 	 */
@@ -901,6 +904,19 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
  */
 bool mem_cgroup_cleancache_disabled(struct page *page);
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
+struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
+
+static inline unsigned long mm_overdraft(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg;
+	unsigned long overdraft;
+
+	memcg = get_mem_cgroup_from_mm(mm);
+	overdraft = memcg->overdraft;
+	css_put(&memcg->css);
+
+	return overdraft;
+}
 
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 		int zid, int nr_pages);
@@ -925,6 +941,7 @@ void mem_cgroup_print_oom_context(struct mem_cgroup *memcg,
 				struct task_struct *p);
 
 void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg);
+unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg);
 
 static inline void mem_cgroup_enter_user_fault(void)
 {
@@ -1330,6 +1347,11 @@ static inline bool mem_cgroup_cleancache_disabled(struct page *page)
 	return false;
 }
 
+static inline unsigned long mm_overdraft(struct mm_struct *mm)
+{
+	return 0;
+}
+
 static inline
 unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec,
 		enum lru_list lru, int zone_idx)
@@ -1352,6 +1374,11 @@ mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
+static inline unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
 static inline void
 mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
 {
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 2db9a1432511..3f3f23a785fc 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -36,6 +36,8 @@ struct oom_control {
 	/* Memory cgroup in which oom is invoked, or NULL for global oom */
 	struct mem_cgroup *memcg;
 
+	unsigned long max_overdraft;
+
 	/* Used to determine cpuset and node locality requirement */
 	const gfp_t gfp_mask;
 
@@ -108,8 +110,23 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 
 bool __oom_reap_task_mm(struct mm_struct *mm);
 
+static inline bool oom_worse(long points, unsigned long overdraft,
+		long *chosen_points, unsigned long *max_overdraft)
+{
+	if (overdraft > *max_overdraft) {
+		*max_overdraft = overdraft;
+		*chosen_points = points;
+		return true;
+	}
+	if (overdraft == *max_overdraft && points > *chosen_points) {
+		*chosen_points = points;
+		return true;
+	}
+	return false;
+}
+
 long oom_badness(struct task_struct *p,
-		unsigned long totalpages);
+		unsigned long totalpages, unsigned long *overdraft);
 
 extern bool out_of_memory(struct oom_control *oc);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aa75ae23a319..3f56f33cf6df 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1168,6 +1168,17 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
 		struct task_struct *task;
+		struct mem_cgroup *parent;
+
+		/*
+		 * Update overdraft of each cgroup under us. This
+		 * information will be used in oom_badness.
+		 */
+		iter->overdraft = mem_cgroup_overdraft(iter);
+		parent = parent_mem_cgroup(iter);
+		if (parent && iter != memcg)
+			iter->overdraft = max(iter->overdraft,
+					parent->overdraft);
 
 		css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it);
 		while (!ret && (task = css_task_iter_next(&it)))
@@ -1296,6 +1307,18 @@ bool mem_cgroup_cleancache_disabled(struct page *page)
 }
 #endif
 
+unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg)
+{
+	unsigned long long guarantee, usage;
+
+	if (mem_cgroup_is_root(memcg))
+		return 0;
+
+	guarantee = READ_ONCE(memcg->oom_guarantee);
+	usage = page_counter_read(&memcg->memsw);
+	return usage > guarantee ? (usage - guarantee) : 0;
+}
+
 /**
  * mem_cgroup_margin - calculate chargeable space of a memory cgroup
  * @memcg: the memory cgroup
@@ -4427,6 +4450,28 @@ static void memsw_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 	return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEMSWAP);
 }
 
+static u64 mem_cgroup_oom_guarantee_read(struct cgroup_subsys_state *css,
+		struct cftype *cft)
+{
+	return mem_cgroup_from_css(css)->oom_guarantee << PAGE_SHIFT;
+}
+
+static ssize_t mem_cgroup_oom_guarantee_write(struct kernfs_open_file *kops,
+					char *buffer, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(kops));
+	unsigned long nr_pages;
+	int ret;
+
+	buffer = strstrip(buffer);
+	ret = page_counter_memparse(buffer, "-1", &nr_pages);
+	if (ret)
+		return ret;
+
+	memcg->oom_guarantee = nr_pages;
+	return nbytes;
+}
+
 #ifdef CONFIG_CLEANCACHE
 static u64 mem_cgroup_disable_cleancache_read(struct cgroup_subsys_state *css,
 					      struct cftype *cft)
@@ -5021,6 +5066,12 @@ static struct cftype mem_cgroup_legacy_files[] = {
 		.write_u64 = mem_cgroup_oom_control_write,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "oom_guarantee",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.write = mem_cgroup_oom_guarantee_write,
+		.read_u64 = mem_cgroup_oom_guarantee_read,
+	},
 	{
 		.name = "pressure_level",
 	},
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 900ffef32a93..0bff802b1887 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -201,11 +201,15 @@ static bool should_dump_unreclaim_slab(void)
  * predictable as possible.  The goal is to return the highest value for the
  * task consuming the most memory to avoid subsequent oom failures.
  */
-long oom_badness(struct task_struct *p, unsigned long totalpages)
+long oom_badness(struct task_struct *p, unsigned long totalpages,
+			  unsigned long *overdraft)
 {
 	long points;
 	long adj;
 
+	if (overdraft)
+		*overdraft = 0;
+
 	if (oom_unkillable_task(p))
 		return LONG_MIN;
 
@@ -213,6 +217,9 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
 	if (!p)
 		return LONG_MIN;
 
+	if (overdraft)
+		*overdraft = mm_overdraft(p->mm);
+
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
 	 * unkillable or have been already oom reaped or the are in
@@ -311,6 +318,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 static int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
+	unsigned long overdraft;
 	long points;
 
 	if (oom_unkillable_task(task))
@@ -338,13 +346,16 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	 */
 	if (oom_task_origin(task)) {
 		points = LONG_MAX;
+		oc->max_overdraft = ULONG_MAX;
 		goto select;
 	}
 
-	points = oom_badness(task, oc->totalpages);
+	points = oom_badness(task, oc->totalpages, &overdraft);
 	if (points == LONG_MIN || points < oc->chosen_points)
 		goto next;
 
+	if (!oom_worse(points, overdraft, &oc->chosen_points, &oc->max_overdraft))
+		goto next;
 select:
 	if (oc->chosen)
 		put_task_struct(oc->chosen);