[CRIU] [PATCHv3 2/3] restore: Fix deadlock when helper's child dies

Thu Jul 20 14:21:26 MSK 2017

On 07/20/2017 10:39 AM, Andrei Vagin wrote:
> On Wed, Jul 19, 2017 at 05:02:53PM +0300, Dmitry Safonov wrote:
>> Since commit ced9c529f687 ("restore: fix race with helpers' kids dying
>> too early"), we block SIGCHLD in helper tasks before CR_STATE_RESTORE.
>> This way we avoided default criu sighandler as it doesn't expect that
>> childs may die.
>>
>> This is very racy as we wait on futex for another stage to be started,
>> but the next stage may start only when all the tasks complete previous
>> stage. If some children of helper dies, the helper may already have
>> blocked SIGCHLD and have started sleeping on the futex. Then the next
>> stage never comes and no one reads a pending SIGCHLD for helper.
>>
>> A customer met this situation on the node, where the following
>> (non-related) problem has occured:
>> Unable to send a fin packet: libnet_write_raw_ipv6(): -1 bytes written (Network is unreachable)
>> Then child criu of the helper has exited with error-code and the
>> lockup has happened.
>>
>> While we could fix it by aborting futex in the end of
>> restore_task_with_children() for each (non-root also) tasks,
>> that would be not completely correct:
>> 1. All futex-waiting tasks will wake up after that and they
>>     may not expect that some tasks are on the previous stage,
>>     so they will spam into logs with unrelated errors and may
>>     also die painfully.
>> 2. Child may die and miss aborting of the futex due to:
>>     o segfault
>>     o OOM killer
>>     o User-sended SIGKILL
>>     o Other error-path we forgot to cover with abort futex
>>
>> To fix this deadlock in TASK_HELPER, as suggested-by Kirill,
>> let's check if there are children deaths expected - if there
>> isn't any, don't block SIGCHLD, otherwise wait() and check if
>> death was on expected stage of restore (not CR_STATE_RESTORE).
>>
>> Signed-off-by: Dmitry Safonov <dsafonov at virtuozzo.com>
>> ---
>>   criu/cr-restore.c | 75 +++++++++++++++++++++++++++++++++++++++----------------
>>   1 file changed, 53 insertions(+), 22 deletions(-)
>>
>> diff --git a/criu/cr-restore.c b/criu/cr-restore.c
>> index 9143cd310fed..21c5fbf3eacc 100644
>> --- a/criu/cr-restore.c
>> +++ b/criu/cr-restore.c
>> @@ -1207,6 +1207,58 @@ out:
>>   	return ret;
>>   }
>>   
>> +static bool child_death_expected(void)
>> +{
>> +	struct pstree_item *pi;
>> +
>> +	list_for_each_entry(pi, &current->children, sibling) {
>> +		switch (pi->pid->state) {
>> +		case TASK_DEAD:
>> +		case TASK_HELPER:
>> +			return true;
>> +		}
>> +	}
>> +
>> +	return false;
>> +}
>> +
> 
> Could you write a comment for this function?

Ok

> 
>> +static int restore_one_helper(void)
>> +{
>> +	sigset_t oldmask;
>> +	siginfo_t info;
>> +
>> +	if (prepare_fds(current))
>> +		return -1;
>> +
>> +	if (!child_death_expected()) {
>> +		restore_finish_stage(task_entries, CR_STATE_RESTORE);
>> +		return 0;
>> +	}
>> +
>> +	if (block_sigmask(&oldmask, SIGCHLD))
> 
> block_sigmask(NULL, SIGCHLD) has to work too, oldmask isn't used

Yep, it was over-caution, as I was not sure if on all archs the
parameter of sigprocmask can be NULL.
But man says it can be, so will do.

>> +		return -1;
>> +
>> +	/* Finish CR_STATE_RESTORE, but do not wait for the next stage. */
>> +	futex_dec_and_wake(&task_entries->nr_in_progress);
>> +
>> +	if (waitid(P_ALL, 0, &info, WEXITED | WNOWAIT)) {
>> +		pr_perror("Failed to wait\n");
>> +		return -1;
>> +	}
>> +
>> +	if (futex_get(&task_entries->start) == CR_STATE_RESTORE) {
>> +		pr_err("Child %d died too early\n", info.si_pid);
>> +		return -1;
>> +	}
>> +
>> +	if (wait_on_helpers_zombies()) {
>> +		pr_err("Failed to wait on helpers and zombies\n");
>> +		return -1;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>   static int restore_one_task(int pid, CoreEntry *core)
>>   {
>>   	int i, ret;
>> @@ -1218,32 +1270,11 @@ static int restore_one_task(int pid, CoreEntry *core)
>>   	else if (current->pid->state == TASK_DEAD)
>>   		ret = restore_one_zombie(core);
>>   	else if (current->pid->state == TASK_HELPER) {
>> -		sigset_t blockmask, oldmask;
>> -
>> -		if (prepare_fds(current))
>> -			return -1;
>> -
>> -		sigemptyset(&blockmask);
>> -		sigaddset(&blockmask, SIGCHLD);
>> -
>> -		if (sigprocmask(SIG_BLOCK, &blockmask, &oldmask) == -1) {
>> -			pr_perror("Can not set mask of blocked signals");
>> -			return -1;
>> -		}
>> -
>> -		restore_finish_stage(task_entries, CR_STATE_RESTORE);
>> -
>> +		ret = restore_one_helper();
>>   		close_image_dir();
>>   		close_proc();
>>   		for (i = SERVICE_FD_MIN + 1; i < SERVICE_FD_MAX; i++)
>>   			close_service_fd(i);
>> -
>> -		if (wait_on_helpers_zombies()) {
>> -			pr_err("failed to wait on helpers and zombies\n");
>> -			ret = -1;
>> -		} else {
>> -			ret = 0;
>> -		}
>>   	} else {
>>   		pr_err("Unknown state in code %d\n", (int)core->tc->task_state);
>>   		ret = -1;
>> -- 
>> 2.13.1
>>
>> _______________________________________________
>> CRIU mailing list
>> CRIU at openvz.org
>> https://lists.openvz.org/mailman/listinfo/criu

-- 
              Dmitry