[CRIU] [PATCHv2 2/3] restore: Fix deadlock when helper's child dies

Wed Jul 19 14:31:53 MSK 2017

On 07/19/2017 01:10 PM, Kirill Tkhai wrote:
> On 18.07.2017 23:05, Dmitry Safonov wrote:
>> Since commit ced9c529f687 ("restore: fix race with helpers' kids dying
>> too early"), we block SIGCHLD in helper tasks before CR_STATE_RESTORE.
>> This way we avoided default criu sighandler as it doesn't expect that
>> childs may die.
>>
>> This is very racy as we wait on futex for another stage to be started,
>> but the next stage may start only when all the tasks complete previous
>> stage. If some children of helper dies, the helper may already have
>> blocked SIGCHLD and have started sleeping on the futex. Then the next
>> stage never comes and no one reads a pending SIGCHLD for helper.
>>
>> A customer met this situation on the node, where the following
>> (non-related) problem has occured:
>> Unable to send a fin packet: libnet_write_raw_ipv6(): -1 bytes written (Network is unreachable)
>> Then child criu of the helper has exited with error-code and the
>> lockup has happened.
>>
>> While we could fix it by aborting futex in the end of
>> restore_task_with_children() for each (non-root also) tasks,
>> that would be not completely correct:
>> 1. All futex-waiting tasks will wake up after that and they
>>     may not expect that some tasks are on the previous stage,
>>     so they will spam into logs with unrelated errors and may
>>     also die painfully.
>> 2. Child may die and miss aborting of the futex due to:
>>     o segfault
>>     o OOM killer
>>     o User-sended SIGKILL
>>     o Other error-path we forgot to cover with abort futex
>>
>> To fix this deadlock in TASK_HELPER, as suggested-by Kirill,
>> let's check if there are children deaths expected - if there
>> isn't any, don't block SIGCHLD, otherwise wait() and check if
>> death was on expected stage of restore (not CR_STATE_RESTORE).
>>
>> Signed-off-by: Dmitry Safonov <dsafonov at virtuozzo.com>
>> ---
>>   criu/cr-restore.c | 83 ++++++++++++++++++++++++++++++++++++++++---------------
>>   1 file changed, 61 insertions(+), 22 deletions(-)
>>
>> diff --git a/criu/cr-restore.c b/criu/cr-restore.c
>> index 9143cd310fed..f40af1ba08c8 100644
>> --- a/criu/cr-restore.c
>> +++ b/criu/cr-restore.c
>> @@ -1207,6 +1207,66 @@ out:
>>   	return ret;
>>   }
>>   
>> +static bool child_death_expected(void)
>> +{
>> +	struct pstree_item *pi;
>> +
>> +	list_for_each_entry(pi, &current->children, sibling) {
>> +		switch (pi->pid->state) {
>> +		case TASK_DEAD:
>> +		case TASK_HELPER:
>> +			return true;
>> +		}
>> +	}
>> +
>> +	return false;
>> +}
>> +
>> +static int restore_one_helper(void)
>> +{
>> +	sigset_t blockmask, oldmask;
>> +	siginfo_t info;
>> +
>> +	if (prepare_fds(current))
>> +		goto abort_futex;
>> +
>> +	if (!child_death_expected()) {
>> +		restore_finish_stage(task_entries, CR_STATE_RESTORE);
>> +		return 0;
>> +	}
>> +
>> +	sigemptyset(&blockmask);
>> +	sigaddset(&blockmask, SIGCHLD);
>> +
>> +	if (sigprocmask(SIG_BLOCK, &blockmask, &oldmask) == -1) {
>> +		pr_perror("Can not set mask of blocked signals");
>> +		goto abort_futex;
>> +	}
> 
> Please, note we already have block_sigmask() primitive for such signal actions.

Ok.

> 
>> +
>> +	futex_dec_and_wake(&task_entries->nr_in_progress);
> 
> I'd add a comment like "Finish CR_STATE_RESTORE, but do not wait for next stage" here.

Ok.

> 
>> +
>> +	if (waitid(P_ALL, 0, &info, WEXITED | WNOWAIT)) {
>> +		pr_perror("Failed to wait\n");
>> +		goto abort_futex;
>> +	}> +
>> +	if (futex_get(&task_entries->start) == CR_STATE_RESTORE) {
>> +		pr_err("Child %d died too early\n", info.si_pid);
>> +		goto abort_futex;
>> +	}
> 
> This means nothing, because here you race with your children exiting
> and with root_item changing the state. Imagine:
> 
> Task                     Unexpected Child              Root Item
> 
> Dec nr_in_progress
> 
>                                                         Sees nr_in_progress == 0
>                                                         Changes state to CR_STATE_RESTORE_SIGCHLD
> waitid()
> futex_get() -> OK
> 
>                         Child suddenly finishes
> 
> In this case Task does not see Child exit.

If an unexpected child has died on the next stage - that's ok.
We can't handle it's death anyway - it races with our death.
The best we can do - is to set signal handler, but I don't think
it worth it.

> 
> 
> Instead of that, this check should be in wait_on_helper_zombies() and executed after
> the first iteration of cycle (after the first wait). This will require change of
> wait_on_helper_zombies() to you any children, not only specific.

You do suggest to add more and more complexity to this code.
I don't think it worth it as a child may unexpectedly die after helper
has waited for all children and exited.

> 
> static int wait_on_helpers_zombies(void)
> {
> 	int nr, once = 1;
> 
> 	nr = nr_children_to_wait(current);
> 	while (nr--) {
> 		pid = waitid(P_ALL, -1);
> 		if (pid < 0)
> 			return -1;
> 		if (!child_exit_expected(pid))
> 			return -1;
> 		if (once) {
> 			once = 0;
> 			if (futex_get(&task_entries->start) == CR_STATE_RESTORE) {
> 				pr_err("Very bad\n");
> 				return -1;
> 			}
> 		}
> 	}
> 	return 0;
> }
> 
> Also, this improves performance, as children may exit in any order, while
> now we're waiting for specific child. If we wait any child, the system
> may work more in parallel.
> It will be just a positive side-effect of the fix

Not correct. It will be slower.
The task will be woken up every time somebody exited.
At this moment it waits for some child and is not being waked when other
childs exit. And wait() returns immediately for already dead childs -
which should be cheaper.

That means the performance of what we have in the *worst case* (when all
childs die in the order of list) is the same that you suggest.

> 
>> +
>> +	if (wait_on_helpers_zombies()) {
>> +		pr_err("Failed to wait on helpers and zombies\n");
>> +		goto abort_futex;
>> +	}
>> +
>> +	return 0;
>> +
>> +abort_futex:
>> +	futex_abort_and_wake(&task_entries->nr_in_progress);
> 
> Not need to abort futex, as caller already does this.
> 
>> +	return -1;
>> +}
>> +
>>   static int restore_one_task(int pid, CoreEntry *core)
>>   {
>>   	int i, ret;
>> @@ -1218,32 +1278,11 @@ static int restore_one_task(int pid, CoreEntry *core)
>>   	else if (current->pid->state == TASK_DEAD)
>>   		ret = restore_one_zombie(core);
>>   	else if (current->pid->state == TASK_HELPER) {
>> -		sigset_t blockmask, oldmask;
>> -
>> -		if (prepare_fds(current))
>> -			return -1;
>> -
>> -		sigemptyset(&blockmask);
>> -		sigaddset(&blockmask, SIGCHLD);
>> -
>> -		if (sigprocmask(SIG_BLOCK, &blockmask, &oldmask) == -1) {
>> -			pr_perror("Can not set mask of blocked signals");
>> -			return -1;
>> -		}
>> -
>> -		restore_finish_stage(task_entries, CR_STATE_RESTORE);
>> -
>> +		ret = restore_one_helper();
>>   		close_image_dir();
>>   		close_proc();
>>   		for (i = SERVICE_FD_MIN + 1; i < SERVICE_FD_MAX; i++)
>>   			close_service_fd(i);
>> -
>> -		if (wait_on_helpers_zombies()) {
>> -			pr_err("failed to wait on helpers and zombies\n");
>> -			ret = -1;
>> -		} else {
>> -			ret = 0;
>> -		}
>>   	} else {
>>   		pr_err("Unknown state in code %d\n", (int)core->tc->task_state);
>>   		ret = -1;
>>
> 
> Thanks
> 

-- 
              Dmitry