[CRIU] [PATCHv2 2/3] restore: Fix deadlock when helper's child dies
Kirill Tkhai
ktkhai at virtuozzo.com
Wed Jul 19 14:55:07 MSK 2017
On 19.07.2017 14:31, Dmitry Safonov wrote:
> On 07/19/2017 01:10 PM, Kirill Tkhai wrote:
>> On 18.07.2017 23:05, Dmitry Safonov wrote:
>>> Since commit ced9c529f687 ("restore: fix race with helpers' kids dying
>>> too early"), we block SIGCHLD in helper tasks before CR_STATE_RESTORE.
>>> This way we avoided default criu sighandler as it doesn't expect that
>>> childs may die.
>>>
>>> This is very racy as we wait on futex for another stage to be started,
>>> but the next stage may start only when all the tasks complete previous
>>> stage. If some children of helper dies, the helper may already have
>>> blocked SIGCHLD and have started sleeping on the futex. Then the next
>>> stage never comes and no one reads a pending SIGCHLD for helper.
>>>
>>> A customer met this situation on the node, where the following
>>> (non-related) problem has occured:
>>> Unable to send a fin packet: libnet_write_raw_ipv6(): -1 bytes written (Network is unreachable)
>>> Then child criu of the helper has exited with error-code and the
>>> lockup has happened.
>>>
>>> While we could fix it by aborting futex in the end of
>>> restore_task_with_children() for each (non-root also) tasks,
>>> that would be not completely correct:
>>> 1. All futex-waiting tasks will wake up after that and they
>>> may not expect that some tasks are on the previous stage,
>>> so they will spam into logs with unrelated errors and may
>>> also die painfully.
>>> 2. Child may die and miss aborting of the futex due to:
>>> o segfault
>>> o OOM killer
>>> o User-sended SIGKILL
>>> o Other error-path we forgot to cover with abort futex
>>>
>>> To fix this deadlock in TASK_HELPER, as suggested-by Kirill,
>>> let's check if there are children deaths expected - if there
>>> isn't any, don't block SIGCHLD, otherwise wait() and check if
>>> death was on expected stage of restore (not CR_STATE_RESTORE).
>>>
>>> Signed-off-by: Dmitry Safonov <dsafonov at virtuozzo.com>
>>> ---
>>> criu/cr-restore.c | 83 ++++++++++++++++++++++++++++++++++++++++---------------
>>> 1 file changed, 61 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/criu/cr-restore.c b/criu/cr-restore.c
>>> index 9143cd310fed..f40af1ba08c8 100644
>>> --- a/criu/cr-restore.c
>>> +++ b/criu/cr-restore.c
>>> @@ -1207,6 +1207,66 @@ out:
>>> return ret;
>>> }
>>> +static bool child_death_expected(void)
>>> +{
>>> + struct pstree_item *pi;
>>> +
>>> + list_for_each_entry(pi, ¤t->children, sibling) {
>>> + switch (pi->pid->state) {
>>> + case TASK_DEAD:
>>> + case TASK_HELPER:
>>> + return true;
>>> + }
>>> + }
>>> +
>>> + return false;
>>> +}
>>> +
>>> +static int restore_one_helper(void)
>>> +{
>>> + sigset_t blockmask, oldmask;
>>> + siginfo_t info;
>>> +
>>> + if (prepare_fds(current))
>>> + goto abort_futex;
>>> +
>>> + if (!child_death_expected()) {
>>> + restore_finish_stage(task_entries, CR_STATE_RESTORE);
>>> + return 0;
>>> + }
>>> +
>>> + sigemptyset(&blockmask);
>>> + sigaddset(&blockmask, SIGCHLD);
>>> +
>>> + if (sigprocmask(SIG_BLOCK, &blockmask, &oldmask) == -1) {
>>> + pr_perror("Can not set mask of blocked signals");
>>> + goto abort_futex;
>>> + }
>>
>> Please, note we already have block_sigmask() primitive for such signal actions.
>
> Ok.
>
>>
>>> +
>>> + futex_dec_and_wake(&task_entries->nr_in_progress);
>>
>> I'd add a comment like "Finish CR_STATE_RESTORE, but do not wait for next stage" here.
>
> Ok.
>
>>
>>> +
>>> + if (waitid(P_ALL, 0, &info, WEXITED | WNOWAIT)) {
>>> + pr_perror("Failed to wait\n");
>>> + goto abort_futex;
>>> + }> +
>>> + if (futex_get(&task_entries->start) == CR_STATE_RESTORE) {
>>> + pr_err("Child %d died too early\n", info.si_pid);
>>> + goto abort_futex;
>>> + }
>>
>> This means nothing, because here you race with your children exiting
>> and with root_item changing the state. Imagine:
>>
>> Task Unexpected Child Root Item
>>
>> Dec nr_in_progress
>>
>> Sees nr_in_progress == 0
>> Changes state to CR_STATE_RESTORE_SIGCHLD
>> waitid()
>> futex_get() -> OK
>>
>> Child suddenly finishes
>>
>> In this case Task does not see Child exit.
>
> If an unexpected child has died on the next stage - that's ok.
> We can't handle it's death anyway - it races with our death.
> The best we can do - is to set signal handler, but I don't think
> it worth it.
Ok, sure.
>>
>>
>> Instead of that, this check should be in wait_on_helper_zombies() and executed after
>> the first iteration of cycle (after the first wait). This will require change of
>> wait_on_helper_zombies() to you any children, not only specific.
>
> You do suggest to add more and more complexity to this code.
> I don't think it worth it as a child may unexpectedly die after helper
> has waited for all children and exited.
>
>>
>> static int wait_on_helpers_zombies(void)
>> {
>> int nr, once = 1;
>>
>> nr = nr_children_to_wait(current);
>> while (nr--) {
>> pid = waitid(P_ALL, -1);
>> if (pid < 0)
>> return -1;
>> if (!child_exit_expected(pid))
>> return -1;
>> if (once) {
>> once = 0;
>> if (futex_get(&task_entries->start) == CR_STATE_RESTORE) {
>> pr_err("Very bad\n");
>> return -1;
>> }
>> }
>> }
>> return 0;
>> }
>>
>> Also, this improves performance, as children may exit in any order, while
>> now we're waiting for specific child. If we wait any child, the system
>> may work more in parallel.
>> It will be just a positive side-effect of the fix
>
> Not correct. It will be slower.
> The task will be woken up every time somebody exited.
> At this moment it waits for some child and is not being waked when other
> childs exit. And wait() returns immediately for already dead childs -
> which should be cheaper.
>
> That means the performance of what we have in the *worst case* (when all
> childs die in the order of list) is the same that you suggest.
>
>>
>>> +
>>> + if (wait_on_helpers_zombies()) {
>>> + pr_err("Failed to wait on helpers and zombies\n");
>>> + goto abort_futex;
>>> + }
>>> +
>>> + return 0;
>>> +
>>> +abort_futex:
>>> + futex_abort_and_wake(&task_entries->nr_in_progress);
>>
>> Not need to abort futex, as caller already does this.
>>
>>> + return -1;
>>> +}
>>> +
>>> static int restore_one_task(int pid, CoreEntry *core)
>>> {
>>> int i, ret;
>>> @@ -1218,32 +1278,11 @@ static int restore_one_task(int pid, CoreEntry *core)
>>> else if (current->pid->state == TASK_DEAD)
>>> ret = restore_one_zombie(core);
>>> else if (current->pid->state == TASK_HELPER) {
>>> - sigset_t blockmask, oldmask;
>>> -
>>> - if (prepare_fds(current))
>>> - return -1;
>>> -
>>> - sigemptyset(&blockmask);
>>> - sigaddset(&blockmask, SIGCHLD);
>>> -
>>> - if (sigprocmask(SIG_BLOCK, &blockmask, &oldmask) == -1) {
>>> - pr_perror("Can not set mask of blocked signals");
>>> - return -1;
>>> - }
>>> -
>>> - restore_finish_stage(task_entries, CR_STATE_RESTORE);
>>> -
>>> + ret = restore_one_helper();
>>> close_image_dir();
>>> close_proc();
>>> for (i = SERVICE_FD_MIN + 1; i < SERVICE_FD_MAX; i++)
>>> close_service_fd(i);
>>> -
>>> - if (wait_on_helpers_zombies()) {
>>> - pr_err("failed to wait on helpers and zombies\n");
>>> - ret = -1;
>>> - } else {
>>> - ret = 0;
>>> - }
>>> } else {
>>> pr_err("Unknown state in code %d\n", (int)core->tc->task_state);
>>> ret = -1;
>>>
>>
>> Thanks
>>
>
>
More information about the CRIU
mailing list