[Devel] [PATCH 10/17] oom: boost dying tasks on global oom

Thu Sep 3 05:22:28 PDT 2015

On Thu, Sep 03, 2015 at 02:06:08PM +0300, Kirill Tkhai wrote:
> 
> 
> On 03.09.2015 13:13, Vladimir Davydov wrote:
> > On Thu, Sep 03, 2015 at 01:09:36PM +0300, Kirill Tkhai wrote:
> >>
> >>
> >> On 14.08.2015 20:03, Vladimir Davydov wrote:
> >>> If an oom victim process has a low prio (nice or via cpu cgroup), it may
> >>> take it very long to complete, which is bad, because the system cannot
> >>> make progress until it dies. To avoid that, this patch makes oom killer
> >>> set victim task prio to the highest possible.
> >>>
> >>> It might be worth submitting this patch upstream. I will probably try.
> >>>
> >>> Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
> >>> ---
> >>>  mm/oom_kill.c | 17 +++++++++++++++--
> >>>  1 file changed, 15 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> >>> index 0e6f7535a565..ca765a82fa1a 100644
> >>> --- a/mm/oom_kill.c
> >>> +++ b/mm/oom_kill.c
> >>> @@ -294,6 +294,15 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> >>>  	return OOM_SCAN_OK;
> >>>  }
> >>>  
> >>> +static void boost_dying_task(struct task_struct *p)
> >>> +{
> >>> +	/*
> >>> +	 * Set the dying task scheduling priority to the highest possible so
> >>> +	 * that it will die quickly irrespective of its scheduling policy.
> >>> +	 */
> >>> +	sched_boost_task(p, 0);
> >>> +}
> >>> +
> >>>  /*
> >>>   * Simple selection loop. We chose the process with the highest
> >>>   * number of 'points'.
> >>> @@ -321,6 +330,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> >>>  		case OOM_SCAN_CONTINUE:
> >>>  			continue;
> >>>  		case OOM_SCAN_ABORT:
> >>> +			boost_dying_task(p);
> >>
> >> This is potential livelock as you are holding at least try_set_zonelist_oom() bits locked
> >> and concurrent thread may use GFP_NOFAIL in __alloc_pages_slowpath(). This case it will be
> >> looping forever.
> > 
> > It won't. There schedule_timeouts all over the place. Besides, if
> > try_set_zonelist_oom fails, the caller will call schedule_timeout.
> 
> Really? What if a victim has signal_pending() flag?

schedule_timeout_uninterruptible ignores that. Anyway, if a dying task,
i.e. the one that was boosted, attempts oom kill, it means we're screwed
up anyway, and letting it in out_of_memory won't help us. It'd just pick
itself and do nothing then. We can only hope that we will select another
task on timeout which won't need to allocate memory or acquire any locks
held by allocating processes to exit.

> 
> Even if it's not, you can't base on schedule_timeout(). No guarantees lock holder will be
> choosen for execution as at all.
>  

That's right. There's always a chance we will dead-lock. This patch set
attempts to minimize it in a way similar to the one we have in PCS6.

I think we need an autotest for stress-testing OOM killer...

Thanks.