[CRIU] [PATCH 1/2] restore: Fix deadlock when helper's child dies

Thu Jul 6 18:58:59 MSK 2017

On 07/06/2017 05:53 PM, Kirill Tkhai wrote:
> On Wed, Jul 05, 2017 at 23:21, Dmitry Safonov wrote:
>> Since commit ced9c529f687 ("restore: fix race with helpers' kids dying
>> too early"), we block SIGCHLD in helper tasks before CR_STATE_RESTORE.
>> This way we avoided default criu sighandler as it doesn't expect that
>> childs may die.
>>
>> This is very racy as we wait on futex for another stage to be started,
>> but the next stage may start only when all the tasks complete previous
>> stage. If some children of helper dies, the helper may already have
>> blocked SIGCHLD and have started sleeping on the futex. Then the next
>> stage never comes and no one reads a pending SIGCHLD for helper.
>>
>> A customer met this situation on the node, where the following
>> (non-related) problem has occured:
>> Unable to send a fin packet: libnet_write_raw_ipv6(): -1 bytes written (Network is unreachable)
>> Then child criu of the helper has exited with error-code and the
>> lockup has happened.
>>
>> While we could fix it by aborting futex in the end of
>> restore_task_with_children() for each (non-root also) tasks,
>> that would be not completely correct:
>> 1. All futex-waiting tasks will wake up after that and they
>>     may not expect that some tasks are on the previous stage,
>>     so they will spam into logs with unrelated errors and may
>>     also die painfully.
>> 2. Child may die and miss aborting of the futex due to:
>>     o segfault
>>     o OOM killer
>>     o User-sended SIGKILL
>>     o Other error-path we forgot to cover with abort futex
>>
>> To fix this deadlock in TASK_HELPER, introduce a new handler for
>> SIGCHLD which will check if the child was expected to die.
>> It should be as simple and plain as possible and only abort
>> futex there - we need reentrancy to avoid missing new signals.
>> No wait for helpers, nothing - only futex aborting.
>>
>> Signed-off-by: Dmitry Safonov <dsafonov at virtuozzo.com>
>> ---
>>   criu/cr-restore.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
>>   1 file changed, 50 insertions(+), 5 deletions(-)
>>
>> diff --git a/criu/cr-restore.c b/criu/cr-restore.c
>> index f9d9b06db731..0da54ce75d41 100644
>> --- a/criu/cr-restore.c
>> +++ b/criu/cr-restore.c
>> @@ -1178,6 +1178,49 @@ out:
>>   	return ret;
>>   }
>>   
>> +static bool is_child_death_expected(pid_t dead_pid)
>> +{
>> +	struct pstree_item *pi;
>> +
>> +	list_for_each_entry(pi, &current->children, sibling) {
>> +		if (dead_pid != vpid(pi))
>> +			continue;
> 
> See [PATCH] restore: Fix waited task pid in criu mailing list

It's sent ~30 minutes ago, but I'll have it in mind.

> 
>> +
>> +		switch (pi->pid->state) {
>> +		case TASK_DEAD:
>> +		case TASK_HELPER:
>> +			return true;
>> +		default:
>> +			return false;
>> +		}
>> +	}
>> +
>> +	return false;
>> +}
>> +
>> +/*
>> + * SIGCHLD isn't blocked in this handler.
>> + * Otherwise we may miss some pending SIGCHLD.
>> + * Rather make the handler re-entrant.
>> + */
>> +static void helper_sigchld_handler(int signal, siginfo_t *siginfo, void *data)
>> +{
>> +	pid_t pid = siginfo->si_pid;
>> +
>> +	if (is_child_death_expected(pid))
> 
> It's wrong. Imagine, you just have installed a handler
> and a child exites right after that; but child has not
> called restore_finish_stage() yet. You'll have the same
> deadlock here.

Nope. It's perfectly valid.

Read the code carefully, please:
futex_abort_and_wake() sets nr_in_progress to FUTEX_ABORT_RAW == -1
The root task does restore_switch_stage() which sleeps with
__restore_wait_inprogress_tasks():

 >	futex_t *np = &task_entries->nr_in_progress;
 >
 >	futex_wait_while_gt(np, participants);
 >	ret = (int)futex_get(np);
 >	if (ret < 0) {
 >		set_cr_errno(get_task_cr_err());
 >		return ret;
 >	}

It will switch the stage and abort restoring with an error.
While helper will go to the next stage and exit.

> 
>> +		return;
>> +
>> +	/*
>> +	 * XXX: Not safe to print from sighandler.
>> +	 * Need something like pr_atomic().
>> +	 * Good for a while as in the worst case we'll have
>> +	 * corrupted messages.
>> +	 */
>> +	pr_info("Error: Unexpected death of %d\n", pid);
>> +
>> +	futex_abort_and_wake(&task_entries->nr_in_progress);
>> +}
>> +
>>   static int restore_one_task(int pid, CoreEntry *core)
>>   {
>>   	int i, ret;
>> @@ -1189,16 +1232,18 @@ static int restore_one_task(int pid, CoreEntry *core)
>>   	else if (current->pid->state == TASK_DEAD)
>>   		ret = restore_one_zombie(core);
>>   	else if (current->pid->state == TASK_HELPER) {
>> -		sigset_t blockmask, oldmask;
>> +		struct sigaction act;
>>   
>>   		if (prepare_fds(current))
>>   			return -1;
>>   
>> -		sigemptyset(&blockmask);
>> -		sigaddset(&blockmask, SIGCHLD);
>> +		sigemptyset(&act.sa_mask); /* Don't block SIGCHLD here! */
>> +		act.sa_flags = SA_NOCLDSTOP | SA_SIGINFO | SA_RESTART;
>> +		act.sa_sigaction = helper_sigchld_handler;
> 
> I'd suggested not to introduce new helper and to do the wait
> synchronous with blocked signals, like we already have. At
> this place you are already know, whether you have to wait
> a child, so simple analysing of waitpid(-1) return value
> will show you if the child dead or not. Something like this:
> 
> if (need to wait) {
> 	block signals
> 	futex dec current stage
> 	pid = waitpid(-1)
> 	if (stage == CR_STATE_RESTORE)	/* task exited before criu switched the stage */
> 		error();
> 	check_pid_is_expected(pid)
> } else {
> 	restore_finish_stage(task_entries, CR_STATE_RESTORE)
> 	/* generic sighandler will catch unexpected child */
> }

Sucks.
You complicate this part which may be a can full of bugs later.

> 
> Asynchronous events are need for something unexpected,
> and they make the code more complicated. Also, if someone
> decides to rework generic sig handler and to use it
> after this lines, he will have to restore it later.
> In my opinion, it's better to have generic handlers
> like we have where it's possible, while this place
> is easily may be written without changing them.

Your proposal expects that helpers may die *only* on the
following stage, which may be changed much more likely.
And it's tough bond.

> 
>>   
>> -		if (sigprocmask(SIG_BLOCK, &blockmask, &oldmask) == -1) {
>> -			pr_perror("Can not set mask of blocked signals");
>> +		ret = sigaction(SIGCHLD, &act, NULL);
>> +		if (ret < 0) {
>> +			pr_perror("sigaction() failed");
>>   			return -1;
>>   		}
>>   
>> -- 
>> 2.13.1
>>
>> _______________________________________________
>> CRIU mailing list
>> CRIU at openvz.org
>> https://lists.openvz.org/mailman/listinfo/criu

-- 
              Dmitry