[Devel] [PATCH RH9 06/33] memcg: add oom_guarantee

Thu Sep 23 22:08:09 MSK 2021

From: Vladimir Davydov <vdavydov at parallels.com>

Feature: mm: OOM guarantee

This patch description:

OOM guarantee works exactly like low limit, but for OOM, i.e. tasks
inside cgroups above the limit are killed first.

Read/write via memory.oom_guarantee.

Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>

oom: rework logic behind memory.oom_guarantee

Rebase to RHEL 7.2 based kernel:
https://jira.sw.ru/browse/PSBM-42320
===
From: Vladimir Davydov <vdavydov at parallels.com>

Patchset description: oom enhancements - part 2

 - Patches 1-2 prepare memcg for upcoming changes in oom design.
 - Patch 3 reworks oom locking design so that the executioner waits for
   victim to exit. This is necessary to increase oom kill rate, which is
   essential for berserker mode.
 - Patch 4 drops unused OOM_SCAN_ABORT
 - Patch 5 introduces oom timeout.
   https://jira.sw.ru/browse/PSBM-38581
 - Patch 6 makes oom fairer when it comes to selecting a victim among
   different containers.
   https://jira.sw.ru/browse/PSBM-37915
 - Patch 7 prepares oom for introducing berserker mode
 - Patch 8 resurrects oom berserker mode, which is supposed to cope with
   actively forking processes.
   https://jira.sw.ru/browse/PSBM-17930

https://jira.sw.ru/browse/PSBM-26973

Changes in v3:
 - rework oom_trylock (patch 3)
 - select exiting process instead of aborting oom scan so as not to keep
   busy-waiting for an exiting process to exit (patches 3, 4)
 - cleanup oom timeout handling + fix stuck process trace dumped
   multiple times on timeout (patch 5)
 - set max_overdraft to ULONG_MAX on selected processes (patch 6)
 - rework oom berserker process selection logic (patches 7, 8)

Changes in v2:
 - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4)
 - propagate victim to the context that initiated oom in oom_unlock
   (patch 6)
 - always set oom_end on releasing oom context (patch 6)

Vladimir Davydov (8):
  memcg: add mem_cgroup_get/put helpers
  memcg: add lock for protecting memcg->oom_notify list
  oom: rework locking design
  oom: introduce oom timeout
  oom: drop OOM_SCAN_ABORT
  oom: rework logic behind memory.oom_guarantee
  oom: pass points and overdraft to oom_kill_process
  oom: resurrect berserker mode

Reviewed-by: Kirill Tkhai <ktkhai at odin.com>

======
This patch set adds memory.oom_guarantee file to memory cgroup which
allows to protect a memory cgroup from OOM killer. It works as follows:
OOM killer first selects from processes in cgroups that are above their
OOM guarantee, and only if there is no such it switches to scanning
processes from all cgroups. This behavior is similar to UB_OOMGUARPAGES.

It also adds OOM kills counter to each memory cgroup and synchronizes
beancounters' UB_OOMGUARPAGES resource with oom_guarantee/oom_kill_cnt
obtained from mem_cgroup.

Related to https://jira.sw.ru/browse/PSBM-20089

=========================================
This patch description:

Currently, memory.oom_guarantee works as a threshold: we first select
processes in cgroups whose usage is below oom guarantee, and only if
there is no eligible process in such cgroups, we disregard oom guarantee
configuration and iterate over all processes. Although simple to
implement, such a behavior is unfair: we do not differentiate between
cgroups that only slightly above their guarantee and those who exceed it
significantly.

This patch therefore reworks the way how memory.oom_guarantee affects
oom killer behavior. First of all, it reverts old logic, which was
introduced by commit e94e18346f74c ("memcg: add oom_guarantee"), leaving
hunks bringing the memory.oom_guarantee knob intact. Then it implements
a new approach of selecting oom victim that works as follows.

Now a task is selected by oom killer iff (a) the memory cgroup which the
process resides in has the greatest overdraft of all cgroups eligible
for scan and (b) the process has the greatest score among all processes
which reside in cgroups with the greatest overdraft. A cgroup's
overdraft is defined as

  (U-G)/(L-G), if U<G,
  0,           otherwise

  G - memory.oom_guarantee
  L - memory.memsw.limit_in_bytes
  U - memory.memsw.usage_in_bytes

https://jira.sw.ru/browse/PSBM-37915

Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>

Conflicts:
	mm/memcontrol.c

+++
mm, memcg, oom_gurantee: change memcg oom overdraft formula

Currently our oom killer kill tasks from cgroup with max overdraft.
Overdraft formula looks like this "usage/(gurantee + 1)". Which makes
all cgroups without gurantee (default) to be a first candidates for oom
kill since "usage_cgrp1"/"hundreds of megabytes" always < usage_cgrp2/1.

Change overdraft formula to simple "usage > guarantee ? usage - guarantee : 0".

Unrelated note: oom_guarantee is 0 by default and not inherited from parent
cgroup. Not going to change it since currently ther is no necessity for it.
Also this would require from userspace to make sure that oom_guarantee set
for all sub cgroups of cgroups like machine.slice wich has unreacheable high
oom_guarantee and we don't want this high guarantee on sub groups.

https://pmc.acronis.com/browse/VSTOR-22575

Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
Reviewed-by: Konstantin Khorenko <khorenko at virtuozzo.com>

https://jira.sw.ru/browse/PSBM-127846
(cherry-picked from vz7 commit 0996bb7c7837 ("mm, memcg, oom_gurantee:
change memcg oom overdraft formula"))

Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>

+++
oom: Fix task selection in oom_evaluate_task()

It was observed that, when OOM happened, OOM killer did not target the
"fattest" tasks first. It might have killed half of a CT's processes before
killing the tasks that actually consumed lots of memory.

This happened because the result of oom_worse() was ignored in
oom_evaluate_task(): a new task was selected even if was not worse than
the previously chosen one.

This patch fixes it.

https://jira.sw.ru/browse/PSBM-132385

Signed-off-by: Evgenii Shatokhin <eshatokhin at virtuozzo.com>

Rebased to vz9:
 - changed patch a bit since oom badness has become signed

(cherry picked from vz8 commit 1912097bf33de29a5caacb206eb03792085391cd)
Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>
---
 fs/proc/base.c             |  2 +-
 include/linux/memcontrol.h | 27 ++++++++++++++++++++++++
 include/linux/oom.h        | 19 ++++++++++++++++-
 mm/memcontrol.c            | 51 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c              | 15 ++++++++++++--
 5 files changed, 110 insertions(+), 4 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index e5b5f77..ff7b277 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -550,7 +550,7 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long points = 0;
 	long badness;
 
-	badness = oom_badness(task, totalpages);
+	badness = oom_badness(task, totalpages, NULL);
 	/*
 	 * Special case OOM_SCORE_ADJ_MIN for all others scale the
 	 * badness value into [0, 2000] range which we have been
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a963668..55313db 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -262,6 +262,9 @@ struct mem_cgroup {
 	/* vmpressure notifications */
 	struct vmpressure vmpressure;
 
+	unsigned long overdraft;
+	unsigned long long oom_guarantee;
+
 	/*
 	 * Should the OOM killer kill all belonging tasks, had it kill one?
 	 */
@@ -901,6 +904,19 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
  */
 bool mem_cgroup_cleancache_disabled(struct page *page);
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
+struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
+
+static inline unsigned long mm_overdraft(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg;
+	unsigned long overdraft;
+
+	memcg = get_mem_cgroup_from_mm(mm);
+	overdraft = memcg->overdraft;
+	css_put(&memcg->css);
+
+	return overdraft;
+}
 
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 		int zid, int nr_pages);
@@ -925,6 +941,7 @@ void mem_cgroup_print_oom_context(struct mem_cgroup *memcg,
 				struct task_struct *p);
 
 void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg);
+unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg);
 
 static inline void mem_cgroup_enter_user_fault(void)
 {
@@ -1330,6 +1347,11 @@ static inline bool mem_cgroup_cleancache_disabled(struct page *page)
 	return false;
 }
 
+static inline unsigned long mm_overdraft(struct mm_struct *mm)
+{
+	return 0;
+}
+
 static inline
 unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec,
 		enum lru_list lru, int zone_idx)
@@ -1352,6 +1374,11 @@ static inline unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
 {
 }
 
+static inline unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
 static inline void
 mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
 {
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 2db9a14..3f3f23a 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -36,6 +36,8 @@ struct oom_control {
 	/* Memory cgroup in which oom is invoked, or NULL for global oom */
 	struct mem_cgroup *memcg;
 
+	unsigned long max_overdraft;
+
 	/* Used to determine cpuset and node locality requirement */
 	const gfp_t gfp_mask;
 
@@ -108,8 +110,23 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 
 bool __oom_reap_task_mm(struct mm_struct *mm);
 
+static inline bool oom_worse(long points, unsigned long overdraft,
+		long *chosen_points, unsigned long *max_overdraft)
+{
+	if (overdraft > *max_overdraft) {
+		*max_overdraft = overdraft;
+		*chosen_points = points;
+		return true;
+	}
+	if (overdraft == *max_overdraft && points > *chosen_points) {
+		*chosen_points = points;
+		return true;
+	}
+	return false;
+}
+
 long oom_badness(struct task_struct *p,
-		unsigned long totalpages);
+		unsigned long totalpages, unsigned long *overdraft);
 
 extern bool out_of_memory(struct oom_control *oc);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aa75ae2..3f56f33 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1168,6 +1168,17 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
 		struct task_struct *task;
+		struct mem_cgroup *parent;
+
+		/*
+		 * Update overdraft of each cgroup under us. This
+		 * information will be used in oom_badness.
+		 */
+		iter->overdraft = mem_cgroup_overdraft(iter);
+		parent = parent_mem_cgroup(iter);
+		if (parent && iter != memcg)
+			iter->overdraft = max(iter->overdraft,
+					parent->overdraft);
 
 		css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it);
 		while (!ret && (task = css_task_iter_next(&it)))
@@ -1296,6 +1307,18 @@ bool mem_cgroup_cleancache_disabled(struct page *page)
 }
 #endif
 
+unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg)
+{
+	unsigned long long guarantee, usage;
+
+	if (mem_cgroup_is_root(memcg))
+		return 0;
+
+	guarantee = READ_ONCE(memcg->oom_guarantee);
+	usage = page_counter_read(&memcg->memsw);
+	return usage > guarantee ? (usage - guarantee) : 0;
+}
+
 /**
  * mem_cgroup_margin - calculate chargeable space of a memory cgroup
  * @memcg: the memory cgroup
@@ -4427,6 +4450,28 @@ static void memsw_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 	return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEMSWAP);
 }
 
+static u64 mem_cgroup_oom_guarantee_read(struct cgroup_subsys_state *css,
+		struct cftype *cft)
+{
+	return mem_cgroup_from_css(css)->oom_guarantee << PAGE_SHIFT;
+}
+
+static ssize_t mem_cgroup_oom_guarantee_write(struct kernfs_open_file *kops,
+					char *buffer, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(kops));
+	unsigned long nr_pages;
+	int ret;
+
+	buffer = strstrip(buffer);
+	ret = page_counter_memparse(buffer, "-1", &nr_pages);
+	if (ret)
+		return ret;
+
+	memcg->oom_guarantee = nr_pages;
+	return nbytes;
+}
+
 #ifdef CONFIG_CLEANCACHE
 static u64 mem_cgroup_disable_cleancache_read(struct cgroup_subsys_state *css,
 					      struct cftype *cft)
@@ -5022,6 +5067,12 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
 	{
+		.name = "oom_guarantee",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.write = mem_cgroup_oom_guarantee_write,
+		.read_u64 = mem_cgroup_oom_guarantee_read,
+	},
+	{
 		.name = "pressure_level",
 	},
 #ifdef CONFIG_NUMA
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 900ffef..0bff802 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -201,11 +201,15 @@ static bool should_dump_unreclaim_slab(void)
  * predictable as possible.  The goal is to return the highest value for the
  * task consuming the most memory to avoid subsequent oom failures.
  */
-long oom_badness(struct task_struct *p, unsigned long totalpages)
+long oom_badness(struct task_struct *p, unsigned long totalpages,
+			  unsigned long *overdraft)
 {
 	long points;
 	long adj;
 
+	if (overdraft)
+		*overdraft = 0;
+
 	if (oom_unkillable_task(p))
 		return LONG_MIN;
 
@@ -213,6 +217,9 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
 	if (!p)
 		return LONG_MIN;
 
+	if (overdraft)
+		*overdraft = mm_overdraft(p->mm);
+
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
 	 * unkillable or have been already oom reaped or the are in
@@ -311,6 +318,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 static int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
+	unsigned long overdraft;
 	long points;
 
 	if (oom_unkillable_task(task))
@@ -338,13 +346,16 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	 */
 	if (oom_task_origin(task)) {
 		points = LONG_MAX;
+		oc->max_overdraft = ULONG_MAX;
 		goto select;
 	}
 
-	points = oom_badness(task, oc->totalpages);
+	points = oom_badness(task, oc->totalpages, &overdraft);
 	if (points == LONG_MIN || points < oc->chosen_points)
 		goto next;
 
+	if (!oom_worse(points, overdraft, &oc->chosen_points, &oc->max_overdraft))
+		goto next;
 select:
 	if (oc->chosen)
 		put_task_struct(oc->chosen);
-- 
1.8.3.1