[CRIU] [PATCH 1/2] restore: Fix deadlock when helper's child dies

Thu Jul 6 19:25:01 MSK 2017

On Thu, Jul 06, 2017 at 18:58, Dmitry Safonov wrote:
> On 07/06/2017 05:53 PM, Kirill Tkhai wrote:
> > On Wed, Jul 05, 2017 at 23:21, Dmitry Safonov wrote:
> > > Since commit ced9c529f687 ("restore: fix race with helpers' kids dying
> > > too early"), we block SIGCHLD in helper tasks before CR_STATE_RESTORE.
> > > This way we avoided default criu sighandler as it doesn't expect that
> > > childs may die.
> > > 
> > > This is very racy as we wait on futex for another stage to be started,
> > > but the next stage may start only when all the tasks complete previous
> > > stage. If some children of helper dies, the helper may already have
> > > blocked SIGCHLD and have started sleeping on the futex. Then the next
> > > stage never comes and no one reads a pending SIGCHLD for helper.
> > > 
> > > A customer met this situation on the node, where the following
> > > (non-related) problem has occured:
> > > Unable to send a fin packet: libnet_write_raw_ipv6(): -1 bytes written (Network is unreachable)
> > > Then child criu of the helper has exited with error-code and the
> > > lockup has happened.
> > > 
> > > While we could fix it by aborting futex in the end of
> > > restore_task_with_children() for each (non-root also) tasks,
> > > that would be not completely correct:
> > > 1. All futex-waiting tasks will wake up after that and they
> > >     may not expect that some tasks are on the previous stage,
> > >     so they will spam into logs with unrelated errors and may
> > >     also die painfully.
> > > 2. Child may die and miss aborting of the futex due to:
> > >     o segfault
> > >     o OOM killer
> > >     o User-sended SIGKILL
> > >     o Other error-path we forgot to cover with abort futex
> > > 
> > > To fix this deadlock in TASK_HELPER, introduce a new handler for
> > > SIGCHLD which will check if the child was expected to die.
> > > It should be as simple and plain as possible and only abort
> > > futex there - we need reentrancy to avoid missing new signals.
> > > No wait for helpers, nothing - only futex aborting.
> > > 
> > > Signed-off-by: Dmitry Safonov <dsafonov at virtuozzo.com>
> > > ---
> > >   criu/cr-restore.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
> > >   1 file changed, 50 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/criu/cr-restore.c b/criu/cr-restore.c
> > > index f9d9b06db731..0da54ce75d41 100644
> > > --- a/criu/cr-restore.c
> > > +++ b/criu/cr-restore.c
> > > @@ -1178,6 +1178,49 @@ out:
> > >   	return ret;
> > >   }
> > > +static bool is_child_death_expected(pid_t dead_pid)
> > > +{
> > > +	struct pstree_item *pi;
> > > +
> > > +	list_for_each_entry(pi, &current->children, sibling) {
> > > +		if (dead_pid != vpid(pi))
> > > +			continue;
> > 
> > See [PATCH] restore: Fix waited task pid in criu mailing list
> 
> It's sent ~30 minutes ago, but I'll have it in mind.
> 
> > 
> > > +
> > > +		switch (pi->pid->state) {
> > > +		case TASK_DEAD:
> > > +		case TASK_HELPER:
> > > +			return true;
> > > +		default:
> > > +			return false;
> > > +		}
> > > +	}
> > > +
> > > +	return false;
> > > +}
> > > +
> > > +/*
> > > + * SIGCHLD isn't blocked in this handler.
> > > + * Otherwise we may miss some pending SIGCHLD.
> > > + * Rather make the handler re-entrant.
> > > + */
> > > +static void helper_sigchld_handler(int signal, siginfo_t *siginfo, void *data)
> > > +{
> > > +	pid_t pid = siginfo->si_pid;
> > > +
> > > +	if (is_child_death_expected(pid))
> > 
> > It's wrong. Imagine, you just have installed a handler
> > and a child exites right after that; but child has not
> > called restore_finish_stage() yet. You'll have the same
> > deadlock here.
> 
> Nope. It's perfectly valid.
> 
> Read the code carefully, please:
> futex_abort_and_wake() sets nr_in_progress to FUTEX_ABORT_RAW == -1
> The root task does restore_switch_stage() which sleeps with
> __restore_wait_inprogress_tasks():
> 
> >	futex_t *np = &task_entries->nr_in_progress;
> >
> >	futex_wait_while_gt(np, participants);
> >	ret = (int)futex_get(np);
> >	if (ret < 0) {
> >		set_cr_errno(get_task_cr_err());
> >		return ret;
> >	}
> 
> It will switch the stage and abort restoring with an error.
> While helper will go to the next stage and exit.

How can it be a perfectly valid, while it perfectly sucks? %)
See below:

[parent]                                          [child]
act.sa_sigaction = helper_sigchld_handler
sigaction(SIGCHLD, &act, NULL)
                                                  restore_one_zombie()
						      [SIGSEGV happenes]

helper_sigchld_handler()
   yes, it's expected child!

restore_finish_stage(CR_STATE_RESTORE);
[Deadlock]

						  [Unreachable]:
						  restore_finish_stage(CR_STATE_RESTORE)

So you hang on restore_finish_stage() like you used to hang.

> > 
> > > +		return;
> > > +
> > > +	/*
> > > +	 * XXX: Not safe to print from sighandler.
> > > +	 * Need something like pr_atomic().
> > > +	 * Good for a while as in the worst case we'll have
> > > +	 * corrupted messages.
> > > +	 */
> > > +	pr_info("Error: Unexpected death of %d\n", pid);
> > > +
> > > +	futex_abort_and_wake(&task_entries->nr_in_progress);
> > > +}
> > > +
> > >   static int restore_one_task(int pid, CoreEntry *core)
> > >   {
> > >   	int i, ret;
> > > @@ -1189,16 +1232,18 @@ static int restore_one_task(int pid, CoreEntry *core)
> > >   	else if (current->pid->state == TASK_DEAD)
> > >   		ret = restore_one_zombie(core);
> > >   	else if (current->pid->state == TASK_HELPER) {
> > > -		sigset_t blockmask, oldmask;
> > > +		struct sigaction act;
> > >   		if (prepare_fds(current))
> > >   			return -1;
> > > -		sigemptyset(&blockmask);
> > > -		sigaddset(&blockmask, SIGCHLD);
> > > +		sigemptyset(&act.sa_mask); /* Don't block SIGCHLD here! */
> > > +		act.sa_flags = SA_NOCLDSTOP | SA_SIGINFO | SA_RESTART;
> > > +		act.sa_sigaction = helper_sigchld_handler;
> > 
> > I'd suggested not to introduce new helper and to do the wait
> > synchronous with blocked signals, like we already have. At
> > this place you are already know, whether you have to wait
> > a child, so simple analysing of waitpid(-1) return value
> > will show you if the child dead or not. Something like this:
> > 
> > if (need to wait) {
> > 	block signals
> > 	futex dec current stage
> > 	pid = waitpid(-1)
> > 	if (stage == CR_STATE_RESTORE)	/* task exited before criu switched the stage */
> > 		error();
> > 	check_pid_is_expected(pid)
> > } else {
> > 	restore_finish_stage(task_entries, CR_STATE_RESTORE)
> > 	/* generic sighandler will catch unexpected child */
> > }
> 
> Sucks.
> You complicate this part which may be a can full of bugs later.
> 
> > 
> > Asynchronous events are need for something unexpected,
> > and they make the code more complicated. Also, if someone
> > decides to rework generic sig handler and to use it
> > after this lines, he will have to restore it later.
> > In my opinion, it's better to have generic handlers
> > like we have where it's possible, while this place
> > is easily may be written without changing them.
> 
> Your proposal expects that helpers may die *only* on the
> following stage, which may be changed much more likely.
> And it's tough bond.

No, my proposal is to handle synchronous events synchronous.
It's easy. And this avoids us from signal handling.

Strange, you did synchronization error in asyncrhronuos code,
and still insist it's easier. It's wrong.

> > 
> > > -		if (sigprocmask(SIG_BLOCK, &blockmask, &oldmask) == -1) {
> > > -			pr_perror("Can not set mask of blocked signals");
> > > +		ret = sigaction(SIGCHLD, &act, NULL);
> > > +		if (ret < 0) {
> > > +			pr_perror("sigaction() failed");
> > >   			return -1;
> > >   		}
> > > -- 
> > > 2.13.1
> > > 
> > > _______________________________________________
> > > CRIU mailing list
> > > CRIU at openvz.org
> > > https://lists.openvz.org/mailman/listinfo/criu
> 
> 
> -- 
>              Dmitry