[CRIU] [PATCHv2 2/3] restore: Fix deadlock when helper's child dies

Wed Jul 19 14:55:07 MSK 2017

On 19.07.2017 14:31, Dmitry Safonov wrote:
> On 07/19/2017 01:10 PM, Kirill Tkhai wrote:
>> On 18.07.2017 23:05, Dmitry Safonov wrote:
>>> Since commit ced9c529f687 ("restore: fix race with helpers' kids dying
>>> too early"), we block SIGCHLD in helper tasks before CR_STATE_RESTORE.
>>> This way we avoided default criu sighandler as it doesn't expect that
>>> childs may die.
>>>
>>> This is very racy as we wait on futex for another stage to be started,
>>> but the next stage may start only when all the tasks complete previous
>>> stage. If some children of helper dies, the helper may already have
>>> blocked SIGCHLD and have started sleeping on the futex. Then the next
>>> stage never comes and no one reads a pending SIGCHLD for helper.
>>>
>>> A customer met this situation on the node, where the following
>>> (non-related) problem has occured:
>>> Unable to send a fin packet: libnet_write_raw_ipv6(): -1 bytes written (Network is unreachable)
>>> Then child criu of the helper has exited with error-code and the
>>> lockup has happened.
>>>
>>> While we could fix it by aborting futex in the end of
>>> restore_task_with_children() for each (non-root also) tasks,
>>> that would be not completely correct:
>>> 1. All futex-waiting tasks will wake up after that and they
>>>     may not expect that some tasks are on the previous stage,
>>>     so they will spam into logs with unrelated errors and may
>>>     also die painfully.
>>> 2. Child may die and miss aborting of the futex due to:
>>>     o segfault
>>>     o OOM killer
>>>     o User-sended SIGKILL
>>>     o Other error-path we forgot to cover with abort futex
>>>
>>> To fix this deadlock in TASK_HELPER, as suggested-by Kirill,
>>> let's check if there are children deaths expected - if there
>>> isn't any, don't block SIGCHLD, otherwise wait() and check if
>>> death was on expected stage of restore (not CR_STATE_RESTORE).
>>>
>>> Signed-off-by: Dmitry Safonov <dsafonov at virtuozzo.com>
>>> ---
>>>   criu/cr-restore.c | 83 ++++++++++++++++++++++++++++++++++++++++---------------
>>>   1 file changed, 61 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/criu/cr-restore.c b/criu/cr-restore.c
>>> index 9143cd310fed..f40af1ba08c8 100644
>>> --- a/criu/cr-restore.c
>>> +++ b/criu/cr-restore.c
>>> @@ -1207,6 +1207,66 @@ out:
>>>       return ret;
>>>   }
>>>   +static bool child_death_expected(void)
>>> +{
>>> +    struct pstree_item *pi;
>>> +
>>> +    list_for_each_entry(pi, &current->children, sibling) {
>>> +        switch (pi->pid->state) {
>>> +        case TASK_DEAD:
>>> +        case TASK_HELPER:
>>> +            return true;
>>> +        }
>>> +    }
>>> +
>>> +    return false;
>>> +}
>>> +
>>> +static int restore_one_helper(void)
>>> +{
>>> +    sigset_t blockmask, oldmask;
>>> +    siginfo_t info;
>>> +
>>> +    if (prepare_fds(current))
>>> +        goto abort_futex;
>>> +
>>> +    if (!child_death_expected()) {
>>> +        restore_finish_stage(task_entries, CR_STATE_RESTORE);
>>> +        return 0;
>>> +    }
>>> +
>>> +    sigemptyset(&blockmask);
>>> +    sigaddset(&blockmask, SIGCHLD);
>>> +
>>> +    if (sigprocmask(SIG_BLOCK, &blockmask, &oldmask) == -1) {
>>> +        pr_perror("Can not set mask of blocked signals");
>>> +        goto abort_futex;
>>> +    }
>>
>> Please, note we already have block_sigmask() primitive for such signal actions.
> 
> Ok.
> 
>>
>>> +
>>> +    futex_dec_and_wake(&task_entries->nr_in_progress);
>>
>> I'd add a comment like "Finish CR_STATE_RESTORE, but do not wait for next stage" here.
> 
> Ok.
> 
>>
>>> +
>>> +    if (waitid(P_ALL, 0, &info, WEXITED | WNOWAIT)) {
>>> +        pr_perror("Failed to wait\n");
>>> +        goto abort_futex;
>>> +    }> +
>>> +    if (futex_get(&task_entries->start) == CR_STATE_RESTORE) {
>>> +        pr_err("Child %d died too early\n", info.si_pid);
>>> +        goto abort_futex;
>>> +    }
>>
>> This means nothing, because here you race with your children exiting
>> and with root_item changing the state. Imagine:
>>
>> Task                     Unexpected Child              Root Item
>>
>> Dec nr_in_progress
>>
>>                                                         Sees nr_in_progress == 0
>>                                                         Changes state to CR_STATE_RESTORE_SIGCHLD
>> waitid()
>> futex_get() -> OK
>>
>>                         Child suddenly finishes
>>
>> In this case Task does not see Child exit.
> 
> If an unexpected child has died on the next stage - that's ok.
> We can't handle it's death anyway - it races with our death.
> The best we can do - is to set signal handler, but I don't think
> it worth it.

Ok, sure.

>>
>>
>> Instead of that, this check should be in wait_on_helper_zombies() and executed after
>> the first iteration of cycle (after the first wait). This will require change of
>> wait_on_helper_zombies() to you any children, not only specific.
> 
> You do suggest to add more and more complexity to this code.
> I don't think it worth it as a child may unexpectedly die after helper
> has waited for all children and exited.
> 
>>
>> static int wait_on_helpers_zombies(void)
>> {
>>     int nr, once = 1;
>>
>>     nr = nr_children_to_wait(current);
>>     while (nr--) {
>>         pid = waitid(P_ALL, -1);
>>         if (pid < 0)
>>             return -1;
>>         if (!child_exit_expected(pid))
>>             return -1;
>>         if (once) {
>>             once = 0;
>>             if (futex_get(&task_entries->start) == CR_STATE_RESTORE) {
>>                 pr_err("Very bad\n");
>>                 return -1;
>>             }
>>         }
>>     }
>>     return 0;
>> }
>>
>> Also, this improves performance, as children may exit in any order, while
>> now we're waiting for specific child. If we wait any child, the system
>> may work more in parallel.
>> It will be just a positive side-effect of the fix
> 
> Not correct. It will be slower.
> The task will be woken up every time somebody exited.
> At this moment it waits for some child and is not being waked when other
> childs exit. And wait() returns immediately for already dead childs -
> which should be cheaper.
> 
> That means the performance of what we have in the *worst case* (when all
> childs die in the order of list) is the same that you suggest.
> 
>>
>>> +
>>> +    if (wait_on_helpers_zombies()) {
>>> +        pr_err("Failed to wait on helpers and zombies\n");
>>> +        goto abort_futex;
>>> +    }
>>> +
>>> +    return 0;
>>> +
>>> +abort_futex:
>>> +    futex_abort_and_wake(&task_entries->nr_in_progress);
>>
>> Not need to abort futex, as caller already does this.
>>
>>> +    return -1;
>>> +}
>>> +
>>>   static int restore_one_task(int pid, CoreEntry *core)
>>>   {
>>>       int i, ret;
>>> @@ -1218,32 +1278,11 @@ static int restore_one_task(int pid, CoreEntry *core)
>>>       else if (current->pid->state == TASK_DEAD)
>>>           ret = restore_one_zombie(core);
>>>       else if (current->pid->state == TASK_HELPER) {
>>> -        sigset_t blockmask, oldmask;
>>> -
>>> -        if (prepare_fds(current))
>>> -            return -1;
>>> -
>>> -        sigemptyset(&blockmask);
>>> -        sigaddset(&blockmask, SIGCHLD);
>>> -
>>> -        if (sigprocmask(SIG_BLOCK, &blockmask, &oldmask) == -1) {
>>> -            pr_perror("Can not set mask of blocked signals");
>>> -            return -1;
>>> -        }
>>> -
>>> -        restore_finish_stage(task_entries, CR_STATE_RESTORE);
>>> -
>>> +        ret = restore_one_helper();
>>>           close_image_dir();
>>>           close_proc();
>>>           for (i = SERVICE_FD_MIN + 1; i < SERVICE_FD_MAX; i++)
>>>               close_service_fd(i);
>>> -
>>> -        if (wait_on_helpers_zombies()) {
>>> -            pr_err("failed to wait on helpers and zombies\n");
>>> -            ret = -1;
>>> -        } else {
>>> -            ret = 0;
>>> -        }
>>>       } else {
>>>           pr_err("Unknown state in code %d\n", (int)core->tc->task_state);
>>>           ret = -1;
>>>
>>
>> Thanks
>>
> 
>