[CRIU] [PATCH 1/2] restore: Fix deadlock when helper's child dies

Thu Jul 6 21:53:28 MSK 2017

On 07/06/2017 07:25 PM, Kirill Tkhai wrote:
> On Thu, Jul 06, 2017 at 18:58, Dmitry Safonov wrote:
>> On 07/06/2017 05:53 PM, Kirill Tkhai wrote:
>>> On Wed, Jul 05, 2017 at 23:21, Dmitry Safonov wrote:
>>>> Since commit ced9c529f687 ("restore: fix race with helpers' kids dying
>>>> too early"), we block SIGCHLD in helper tasks before CR_STATE_RESTORE.
>>>> This way we avoided default criu sighandler as it doesn't expect that
>>>> childs may die.
>>>>
>>>> This is very racy as we wait on futex for another stage to be started,
>>>> but the next stage may start only when all the tasks complete previous
>>>> stage. If some children of helper dies, the helper may already have
>>>> blocked SIGCHLD and have started sleeping on the futex. Then the next
>>>> stage never comes and no one reads a pending SIGCHLD for helper.
>>>>
>>>> A customer met this situation on the node, where the following
>>>> (non-related) problem has occured:
>>>> Unable to send a fin packet: libnet_write_raw_ipv6(): -1 bytes written (Network is unreachable)
>>>> Then child criu of the helper has exited with error-code and the
>>>> lockup has happened.
>>>>
>>>> While we could fix it by aborting futex in the end of
>>>> restore_task_with_children() for each (non-root also) tasks,
>>>> that would be not completely correct:
>>>> 1. All futex-waiting tasks will wake up after that and they
>>>>      may not expect that some tasks are on the previous stage,
>>>>      so they will spam into logs with unrelated errors and may
>>>>      also die painfully.
>>>> 2. Child may die and miss aborting of the futex due to:
>>>>      o segfault
>>>>      o OOM killer
>>>>      o User-sended SIGKILL
>>>>      o Other error-path we forgot to cover with abort futex
>>>>
>>>> To fix this deadlock in TASK_HELPER, introduce a new handler for
>>>> SIGCHLD which will check if the child was expected to die.
>>>> It should be as simple and plain as possible and only abort
>>>> futex there - we need reentrancy to avoid missing new signals.
>>>> No wait for helpers, nothing - only futex aborting.
>>>>
>>>> Signed-off-by: Dmitry Safonov <dsafonov at virtuozzo.com>
>>>> ---
>>>>    criu/cr-restore.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
>>>>    1 file changed, 50 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/criu/cr-restore.c b/criu/cr-restore.c
>>>> index f9d9b06db731..0da54ce75d41 100644
>>>> --- a/criu/cr-restore.c
>>>> +++ b/criu/cr-restore.c
>>>> @@ -1178,6 +1178,49 @@ out:
>>>>    	return ret;
>>>>    }
>>>> +static bool is_child_death_expected(pid_t dead_pid)
>>>> +{
>>>> +	struct pstree_item *pi;
>>>> +
>>>> +	list_for_each_entry(pi, &current->children, sibling) {
>>>> +		if (dead_pid != vpid(pi))
>>>> +			continue;
>>>
>>> See [PATCH] restore: Fix waited task pid in criu mailing list
>>
>> It's sent ~30 minutes ago, but I'll have it in mind.
>>
>>>
>>>> +
>>>> +		switch (pi->pid->state) {
>>>> +		case TASK_DEAD:
>>>> +		case TASK_HELPER:
>>>> +			return true;
>>>> +		default:
>>>> +			return false;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	return false;
>>>> +}
>>>> +
>>>> +/*
>>>> + * SIGCHLD isn't blocked in this handler.
>>>> + * Otherwise we may miss some pending SIGCHLD.
>>>> + * Rather make the handler re-entrant.
>>>> + */
>>>> +static void helper_sigchld_handler(int signal, siginfo_t *siginfo, void *data)
>>>> +{
>>>> +	pid_t pid = siginfo->si_pid;
>>>> +
>>>> +	if (is_child_death_expected(pid))
>>>
>>> It's wrong. Imagine, you just have installed a handler
>>> and a child exites right after that; but child has not
>>> called restore_finish_stage() yet. You'll have the same
>>> deadlock here.
>>
>> Nope. It's perfectly valid.
>>
>> Read the code carefully, please:
>> futex_abort_and_wake() sets nr_in_progress to FUTEX_ABORT_RAW == -1
>> The root task does restore_switch_stage() which sleeps with
>> __restore_wait_inprogress_tasks():
>>
>>> 	futex_t *np = &task_entries->nr_in_progress;
>>>
>>> 	futex_wait_while_gt(np, participants);
>>> 	ret = (int)futex_get(np);
>>> 	if (ret < 0) {
>>> 		set_cr_errno(get_task_cr_err());
>>> 		return ret;
>>> 	}
>>
>> It will switch the stage and abort restoring with an error.
>> While helper will go to the next stage and exit.
> 
> How can it be a perfectly valid, while it perfectly sucks? %)
> See below:
> 
> [parent]                                          [child]
> act.sa_sigaction = helper_sigchld_handler
> sigaction(SIGCHLD, &act, NULL)
>                                                    restore_one_zombie()
> 						      [SIGSEGV happenes]
> 						
> helper_sigchld_handler()
>     yes, it's expected child!
>   
> 
> restore_finish_stage(CR_STATE_RESTORE);
> [Deadlock]
> 
> 
> 
> 						  [Unreachable]:
> 						  restore_finish_stage(CR_STATE_RESTORE)
> 
> So you hang on restore_finish_stage() like you used to hang.

Well, looks like mine does suck more than yours.
I could insert (stage == CR_STATE_RESTORE) thing into handler,
but as we bond to stage anyway.. Maybe it makes sence to make
it synchronic, I'll think about it.

-- 
              Dmitry