[CRIU] [PATCH 4/4] restore: Fix hang if root task is waiting on zombie
Cyrill Gorcunov
gorcunov at gmail.com
Tue Dec 11 12:36:29 MSK 2018
On Mon, Dec 10, 2018 at 11:07:09PM -0800, Andrey Vagin wrote:
> > A zombie is not reparented to us yet.
>
> This means that we are waiting when a process (helper) will die and its
> children (zombies) will be reparanted to the init process.
>
> If nr_in_progress is 1, this means that there is no processes which are
> going to die, doesn't it? It the answer is yes, what are we waiting
> here?
If nr_in_progress is 1 it means the zombie processes already decremented
nr_in_progress but may not be yet reparented to us, because we do decrement
nr_in_progress first and only then call kill(self) in zombie restore code.
for (i = 0; i < task_args->zombies_n; i++) {
int ret, nr_in_progress;
nr_in_progress = futex_get(&task_entries_local->nr_in_progress);
ret = sys_waitid(P_PID, task_args->zombies[i], NULL, WNOWAIT | WEXITED, NULL);
if (ret == -ECHILD) {
/* A process isn't reparented to this task yet.
* Let's wait when someone complete this stage
* and try again.
*/
--> futex_wait_while_eq(&task_entries_local->nr_in_progress,
nr_in_progress);
i--;
continue;
}
The nr_in_progress is fetched earlier and equal to 1. Then we call for waitid
and got -ECHILD which resulted that we're sitting in futex wait forever. That
is what I've been notified about when investigating a bug. Unfortunately I
lost the container failing and was unable to recreate a local test case.
I suspect there is a race window between signal delivery and nr_in_progress
updating.
Still since I've no local reproduction maybe it worth simply let this patch
flying around and if it triggers someday we will back to the question.
More information about the CRIU
mailing list