[Devel] [PATCH RHEL7 COMMIT] ms/oom: add helpers for setting and clearing TIF_MEMDIE

Konstantin Khorenko khorenko at virtuozzo.com
Thu Oct 15 06:47:34 PDT 2015


The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
------>
commit 4860757ccf723defc3ba770ca3ad3f8c67c4ae20
Author: Vladimir Davydov <vdavydov at parallels.com>
Date:   Thu Oct 15 17:47:34 2015 +0400

    ms/oom: add helpers for setting and clearing TIF_MEMDIE
    
    Patchset description: oom enhancements - part 1
    
    Pull mainstream patches that clean up TIF_MEMDIE handling. They will
    come in handy for the upcoming oom rework.
    
    https://jira.sw.ru/browse/PSBM-26973
    
    David Rientjes (1):
      mm, oom: remove unnecessary exit_state check
    
    Johannes Weiner (1):
      mm: oom_kill: clean up victim marking and exiting interfaces
    
    Michal Hocko (3):
      oom: make sure that TIF_MEMDIE is set under task_lock
      oom: add helpers for setting and clearing TIF_MEMDIE
      oom: thaw the OOM victim if it is frozen
    
    Tetsuo Handa (1):
      oom: don't count on mm-less current process
    
    ===============================================
    This patch desciption:
    
    From: Michal Hocko <mhocko at suse.cz>
    
    This patchset addresses a race which was described in the changelog for
    5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend"):
    
    : PM freezer relies on having all tasks frozen by the time devices are
    : getting frozen so that no task will touch them while they are getting
    : frozen.  But OOM killer is allowed to kill an already frozen task in order
    : to handle OOM situtation.  In order to protect from late wake ups OOM
    : killer is disabled after all tasks are frozen.  This, however, still keeps
    : a window open when a killed task didn't manage to die by the time
    : freeze_processes finishes.
    
    The original patch hasn't closed the race window completely because that
    would require a more complex solution as it can be seen by this patchset.
    
    The primary motivation was to close the race condition between OOM killer
    and PM freezer _completely_.  As Tejun pointed out, even though the race
    condition is unlikely the harder it would be to debug weird bugs deep in
    the PM freezer when the debugging options are reduced considerably.  I can
    only speculate what might happen when a task is still runnable
    unexpectedly.
    
    On a plus side and as a side effect the oom enable/disable has a better
    (full barrier) semantic without polluting hot paths.
    
    I have tested the series in KVM with 100M RAM:
    - many small tasks (20M anon mmap) which are triggering OOM continually
    - s2ram which resumes automatically is triggered in a loop
    	echo processors > /sys/power/pm_test
    	while true
    	do
    		echo mem > /sys/power/state
    		sleep 1s
    	done
    - simple module which allocates and frees 20M in 8K chunks. If it sees
      freezing(current) then it tries another round of allocation before calling
      try_to_freeze
    - debugging messages of PM stages and OOM killer enable/disable/fail added
      and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
      it wakes up waiters.
    - rebased on top of the current mmotm which means some necessary updates
      in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
      I think this should be OK because __thaw_task shouldn't interfere with any
      locking down wake_up_process. Oleg?
    
    As expected there are no OOM killed tasks after oom is disabled and
    allocations requested by the kernel thread are failing after all the tasks
    are frozen and OOM disabled.  I wasn't able to catch a race where
    oom_killer_disable would really have to wait but I kinda expected the race
    is really unlikely.
    
    [  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
    [  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
    [  243.636072] (elapsed 2.837 seconds) done.
    [  243.641985] Trying to disable OOM killer
    [  243.643032] Waiting for concurent OOM victims
    [  243.644342] OOM killer disabled
    [  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
    [  243.652983] Suspending console(s) (use no_console_suspend to debug)
    [  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
    [...]
    [  243.992600] PM: suspend of devices complete after 336.667 msecs
    [  243.993264] PM: late suspend of devices complete after 0.660 msecs
    [  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
    [  243.994717] ACPI: Preparing to enter system sleep state S3
    [  243.994795] PM: Saving platform NVS memory
    [  243.994796] ms/Disabling non-boot CPUs ...
    
    The first 2 patches are simple cleanups for OOM.  They should go in
    regardless the rest IMO.
    
    Patches 3 and 4 are trivial printk -> pr_info conversion and they should
    go in ditto.
    
    The main patch is the last one and I would appreciate acks from Tejun and
    Rafael.  I think the OOM part should be OK (except for __thaw_task vs.
    task_lock where a look from Oleg would appreciated) but I am not so sure I
    haven't screwed anything in the freezer code.  I have found several
    surprises there.
    
    This patch (of 5):
    
    This patch is just a preparatory and it doesn't introduce any functional
    change.
    
    Note:
    I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
    wait for the oom victim and to prevent from new killing. This is
    just a side effect of the flag. The primary meaning is to give the oom
    victim access to the memory reserves and that shouldn't be necessary
    here.
    
    Signed-off-by: Michal Hocko <mhocko at suse.cz>
    Cc: Tejun Heo <tj at kernel.org>
    Cc: David Rientjes <rientjes at google.com>
    Cc: Johannes Weiner <hannes at cmpxchg.org>
    Cc: Oleg Nesterov <oleg at redhat.com>
    Cc: Cong Wang <xiyou.wangcong at gmail.com>
    Cc: "Rafael J. Wysocki" <rjw at rjwysocki.net>
    Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
    (cherry picked from commit 49550b605587924b3336386caae53200c68969d3)
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
    
    Reviewed-by: Kirill Tkhai <ktkhai at odin.com>
    
    Conflicts:
    	drivers/staging/android/lowmemorykiller.c
    	include/linux/oom.h
    	mm/memcontrol.c
    	mm/oom_kill.c
    
    vdavydov:
    	Call unmark_oom_victim only if TIF_MEMDIE is set (see
    	kernel/exit.c:exit_mm). This was originally done by mainstream
    	commit c32b3cbe0d067 ("oom, PM: make OOM detection in the
    	freezer path raceless"), but I don't want to pull it although I
    	really need this check.
---
 drivers/staging/android/lowmemorykiller.c |  7 ++++++-
 include/linux/oom.h                       |  4 ++++
 kernel/exit.c                             |  3 ++-
 mm/memcontrol.c                           |  2 +-
 mm/oom_kill.c                             | 23 ++++++++++++++++++++---
 5 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index 6f094b3..4dd6a34 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -159,8 +159,13 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 			     selected->pid, selected->comm,
 			     selected_oom_score_adj, selected_tasksize);
 		lowmem_deathpending_timeout = jiffies + HZ;
+		/*
+		 * FIXME: lowmemorykiller shouldn't abuse global OOM killer
+		 * infrastructure. There is no real reason why the selected
+		 * task should have access to the memory reserves.
+		 */
+		mark_tsk_oom_victim(selected);
 		send_sig(SIGKILL, selected, 0);
-		set_tsk_thread_flag(selected, TIF_MEMDIE);
 		rem += selected_tasksize;
 	}
 
diff --git a/include/linux/oom.h b/include/linux/oom.h
index a751de7..3c37f1e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -52,6 +52,10 @@ static inline bool oom_task_origin(const struct task_struct *p)
 /* linux/mm/oom_group.c */
 extern int get_task_oom_score_adj(struct task_struct *t);
 
+extern void mark_tsk_oom_victim(struct task_struct *tsk);
+
+extern void unmark_oom_victim(void);
+
 extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 			  const nodemask_t *nodemask, unsigned long totalpages);
 extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
diff --git a/kernel/exit.c b/kernel/exit.c
index 90feb5f..1b13207 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -520,7 +520,8 @@ static void exit_mm(struct task_struct * tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	clear_thread_flag(TIF_MEMDIE);
+	if (test_thread_flag(TIF_MEMDIE))
+		unmark_oom_victim();
 }
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 00cc66d..cf9ca7f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1964,7 +1964,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
-		set_thread_flag(TIF_MEMDIE);
+		mark_tsk_oom_victim(current);
 		return;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5ac5d96..224dd8d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -410,6 +410,23 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 		dump_tasks(memcg, nodemask);
 }
 
+/**
+ * mark_tsk_oom_victim - marks the given taks as OOM victim.
+ * @tsk: task to mark
+ */
+void mark_tsk_oom_victim(struct task_struct *tsk)
+{
+	set_tsk_thread_flag(tsk, TIF_MEMDIE);
+}
+
+/**
+ * unmark_oom_victim - unmarks the current task as OOM victim.
+ */
+void unmark_oom_victim(void)
+{
+	clear_thread_flag(TIF_MEMDIE);
+}
+
 #define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
@@ -434,7 +451,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 */
 	task_lock(p);
 	if (p->mm && p->flags & PF_EXITING) {
-		set_tsk_thread_flag(p, TIF_MEMDIE);
+		mark_tsk_oom_victim(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -489,7 +506,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 	/* mm cannot safely be dereferenced after task_unlock(victim) */
 	mm = victim->mm;
-	set_tsk_thread_flag(victim, TIF_MEMDIE);
+	mark_tsk_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -652,7 +669,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 */
 	if (current->mm &&
 	    (fatal_signal_pending(current) || current->flags & PF_EXITING)) {
-		set_thread_flag(TIF_MEMDIE);
+		mark_tsk_oom_victim(current);
 		return;
 	}
 



More information about the Devel mailing list