[CRIU] [PATCH] restore: helpers and zombies should collect their children
Tycho Andersen
tycho.andersen at canonical.com
Wed Mar 16 13:11:02 PDT 2016
On Wed, Mar 16, 2016 at 10:37:47AM -0700, Andrew Vagin wrote:
> On Tue, Mar 15, 2016 at 04:48:37PM -0600, Tycho Andersen wrote:
> > Consider when there is a double fork of helpers or zombies, e.g. when a
> > zombie has a session id which doesn't match its pid. If the child dies and
> > exits before the grandchild, the grandchild reparents to init, and when the
> > task dies init doesn't have it in the helper list, so init dies as well,
> > viz. the log below.
> >
> > (00.118789) Add a helper 293 for restoring SID 293
> > (00.118792) Attach 294 to the temporary task 293
> > ...
> > (01.394403) 294: Restoring zombie with 0 code
> > ...
> > pie: Task 294 exited, status= 0
> > (01.434279) Error (cr-restore.c:1308): 12097 killed by signal 19
> > (01.434420) Error (cr-restore.c:1308): 12097 killed by signal 19
> > (01.450258) Switching to new ns to clean ghosts
> > (01.450324) Error (cr-restore.c:2138): Restoring FAILED.
> >
> > Let's have the zombies and helpers reap their children before they exit to
> > avoid this.
> >
> > Signed-off-by: Tycho Andersen <tycho.andersen at canonical.com>
> > ---
> > Full log is available at: http://paste.ubuntu.com/15396732/
> > ---
> > criu/cr-restore.c | 35 ++++++++++++++++++++++++++++++++++-
> > 1 file changed, 34 insertions(+), 1 deletion(-)
> >
> > diff --git a/criu/cr-restore.c b/criu/cr-restore.c
> > index 30ddff9..f95adce 100644
> > --- a/criu/cr-restore.c
> > +++ b/criu/cr-restore.c
> > @@ -972,6 +972,32 @@ static inline int sig_fatal(int sig)
> > struct task_entries *task_entries;
> > static unsigned long task_entries_pos;
> >
> > +static int wait_on_helpers_zombies(void)
> > +{
> > + struct pstree_item *pi;
> > +
> > + list_for_each_entry(pi, ¤t->children, sibling) {
> > + pid_t pid = pi->pid.virt;
> > + int status;
> > +
>
> Should we block SIGCHLD before calling this function?
Yes, I think we should. I will send an updated patch.
Tycho
> If the answer is no, why will it not race with sigchld_handler().
>
> > + switch (pi->state) {
> > + case TASK_DEAD:
> > + if (waitid(P_PID, pid, NULL, WNOWAIT | WEXITED) < 0) {
> > + pr_perror("Wait on %d zombie failed\n", pid);
> > + return -1;
> > + }
> > + futex_dec_and_wake(&task_entries->nr_in_progress);
> > + case TASK_HELPER:
> > + if (waitpid(pid, &status, 0) != pid) {
> > + pr_perror("waitpid for helper %d failed", pid);
> > + return -1;
> > + }
> > + }
> > + }
> > +
> > + return 0;
> > +}
> > +
> > static int restore_one_zombie(CoreEntry *core)
> > {
> > int exit_code = core->tc->exit_code;
> > @@ -985,6 +1011,8 @@ static int restore_one_zombie(CoreEntry *core)
> >
> > if (task_entries != NULL) {
> > restore_finish_stage(CR_STATE_RESTORE);
> > + if (wait_on_helpers_zombies())
> > + pr_err("failed to wait on helpers and zombies\n");
> > zombie_prepare_signals();
> > }
> >
> > @@ -1057,7 +1085,12 @@ static int restore_one_task(int pid, CoreEntry *core)
> > ret = restore_one_zombie(core);
> > else if (current->state == TASK_HELPER) {
> > restore_finish_stage(CR_STATE_RESTORE);
> > - ret = 0;
> > + if (wait_on_helpers_zombies()) {
> > + pr_err("failed to wait on helpers and zombies\n");
> > + ret = -1;
> > + } else {
> > + ret = 0;
> > + }
> > } else {
> > pr_err("Unknown state in code %d\n", (int)core->tc->task_state);
> > ret = -1;
> > --
> > 2.7.0
> >
> > _______________________________________________
> > CRIU mailing list
> > CRIU at openvz.org
> > https://lists.openvz.org/mailman/listinfo/criu
More information about the CRIU
mailing list