[CRIU] [PATCH 2/2] rst: don't hang when SIGCHLD is coalesced

Tycho Andersen tycho.andersen at canonical.com
Tue Jul 21 08:15:38 PDT 2015


On Tue, Jul 21, 2015 at 12:42:26PM +0300, Andrew Vagin wrote:
> On Mon, Jul 20, 2015 at 11:56:43PM -0600, Tycho Andersen wrote:
> > +static int wait_zombies(struct task_restore_args *task_args)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < task_args->zombies_n; i++) {
> > +		if (sys_waitid(P_PID, task_args->zombies[i], NULL, WNOWAIT | WEXITED, NULL) < 0) {
> > +			pr_err("Wait on %d zombie failed\n", task_args->zombies[i]);
> > +			return -1;
> > +		}
> > +		pr_debug("%ld: Collect a zombie with pid %d\n",
> > +			sys_getpid(), task_args->zombies[i]);
> > +		task_entries->nr_threads--;
> > +		task_entries->nr_tasks--;
> > +		futex_dec_and_wake(&task_entries->nr_in_progress);
> > +	}
> 
> I think we don't need to decriment counters in the loop:
> 	task_entries->nr_threads -= task_args->zombies_n;
> 	task_entries->nr_tasks   -= task_args->zombies_n;

Yep, sounds good.

> > +
> > +	return 0;
> > +}
> > +
> >  /*
> >   * The main routine to restore task via sigreturn.
> >   * This one is very special, we never return there
> > @@ -836,6 +849,8 @@ long __export_restore_task(struct task_restore_args *args)
> >  	task_entries = args->task_entries;
> >  	helpers = args->helpers;
> >  	n_helpers = args->helpers_n;
> > +	zombies = args->zombies;
> > +	n_zombies = args->zombies_n;
> >  	*args->breakpoint = rst_sigreturn;
> >  
> >  	ksigfillset(&act.rt_sa_mask);
> > @@ -1190,11 +1205,10 @@ long __export_restore_task(struct task_restore_args *args)
> >  
> >  	pr_info("%ld: Restored\n", sys_getpid());
> >  
> > -	futex_set(&zombies_inprogress, args->nr_zombies);
> > -
> 
> We can block SIGCHLD here
> 
> >  	restore_finish_stage(CR_STATE_RESTORE);
> >  
> > -	futex_wait_while_gt(&zombies_inprogress, 0);
> > +	if (wait_zombies(args) < 0)
> > +		goto core_restore_end;
> >  
> >  	if (wait_helpers(args) < 0)
> >  		goto core_restore_end;
> 
> and unblock SIGCHLD back here
> 
> In thise case, sigchld_handler will be executed only once for all
> helpers and zombies. What do you think about this?

That sounds ok, except that if one non-zombie dies, it might be
coalesced with the zombie, so we would miss that it died and not do
the futex_abort_and_wake() to handle the non-zombie case.

In fact, this can happen with the patch as it is now, but if we made
the change you suggest it makes it more likely. I'm not sure what the
best way to handle this case is (other than "don't have any bugs" :D).

Tycho


More information about the CRIU mailing list