[CRIU] [PATCH 2/2] rst: don't hang when SIGCHLD is coalesced

Andrey Wagin avagin at gmail.com
Tue Jul 21 10:59:23 PDT 2015


2015-07-21 18:15 GMT+03:00 Tycho Andersen <tycho.andersen at canonical.com>:
> On Tue, Jul 21, 2015 at 12:42:26PM +0300, Andrew Vagin wrote:
>> On Mon, Jul 20, 2015 at 11:56:43PM -0600, Tycho Andersen wrote:
>> > +static int wait_zombies(struct task_restore_args *task_args)
>> > +{
>> > +   int i;
>> > +
>> > +   for (i = 0; i < task_args->zombies_n; i++) {
>> > +           if (sys_waitid(P_PID, task_args->zombies[i], NULL, WNOWAIT | WEXITED, NULL) < 0) {
>> > +                   pr_err("Wait on %d zombie failed\n", task_args->zombies[i]);
>> > +                   return -1;
>> > +           }
>> > +           pr_debug("%ld: Collect a zombie with pid %d\n",
>> > +                   sys_getpid(), task_args->zombies[i]);
>> > +           task_entries->nr_threads--;
>> > +           task_entries->nr_tasks--;
>> > +           futex_dec_and_wake(&task_entries->nr_in_progress);
>> > +   }
>>
>> I think we don't need to decriment counters in the loop:
>>       task_entries->nr_threads -= task_args->zombies_n;
>>       task_entries->nr_tasks   -= task_args->zombies_n;
>
> Yep, sounds good.
>
>> > +
>> > +   return 0;
>> > +}
>> > +
>> >  /*
>> >   * The main routine to restore task via sigreturn.
>> >   * This one is very special, we never return there
>> > @@ -836,6 +849,8 @@ long __export_restore_task(struct task_restore_args *args)
>> >     task_entries = args->task_entries;
>> >     helpers = args->helpers;
>> >     n_helpers = args->helpers_n;
>> > +   zombies = args->zombies;
>> > +   n_zombies = args->zombies_n;
>> >     *args->breakpoint = rst_sigreturn;
>> >
>> >     ksigfillset(&act.rt_sa_mask);
>> > @@ -1190,11 +1205,10 @@ long __export_restore_task(struct task_restore_args *args)
>> >
>> >     pr_info("%ld: Restored\n", sys_getpid());
>> >
>> > -   futex_set(&zombies_inprogress, args->nr_zombies);
>> > -
>>
>> We can block SIGCHLD here
>>
>> >     restore_finish_stage(CR_STATE_RESTORE);
>> >
>> > -   futex_wait_while_gt(&zombies_inprogress, 0);
>> > +   if (wait_zombies(args) < 0)
>> > +           goto core_restore_end;
>> >
>> >     if (wait_helpers(args) < 0)
>> >             goto core_restore_end;
>>
>> and unblock SIGCHLD back here
>>
>> In thise case, sigchld_handler will be executed only once for all
>> helpers and zombies. What do you think about this?
>
> That sounds ok, except that if one non-zombie dies, it might be
> coalesced with the zombie, so we would miss that it died and not do
> the futex_abort_and_wake() to handle the non-zombie case.

Yes, You are right.

Actually it's another problem which was here before your changes. We
are going to fix it separate from this one.

Acked-by: Andrew Vagin <avagin at openvz.org>

Thanks,
Andrew

>
> In fact, this can happen with the patch as it is now, but if we made
> the change you suggest it makes it more likely. I'm not sure what the
> best way to handle this case is (other than "don't have any bugs" :D).
>
> Tycho


More information about the CRIU mailing list