[CRIU] hang when restoring container with zombies

Andrew Vagin avagin at gmail.com
Mon Jul 20 05:36:19 PDT 2015


On Mon, Jul 20, 2015 at 01:01:51PM +0300, Pavel Emelyanov wrote:
> On 07/17/2015 11:41 PM, Tycho Andersen wrote:
> > On Fri, Jul 17, 2015 at 09:20:28PM +0300, Pavel Emelyanov wrote:
> >> On 07/17/2015 07:41 PM, Tycho Andersen wrote:
> >>> On Fri, Jul 17, 2015 at 07:15:55PM +0300, Pavel Emelyanov wrote:
> >>>> On 07/17/2015 06:36 PM, Tycho Andersen wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> I'm experiencing a hang when restoring a process with zombies; the
> >>>>> zombies exit, but the parent process (in this case the container's
> >>>>> init) isn't getting the SIGCHLD, so it just gets stuck waiting for
> >>>>> zombies_inprogress. The parent process' /proc/pid/status is below, and
> >>>>> it doesn't seem to be blocking SIGCHLD and there are no pending
> >>>>> signals. I stuck a printf in the sigchld_handler in the restorer blob,
> >>>>> and it does get called for sid helpers, but not for the zombie
> >>>>> processes.
> >>>>>
> >>>>> Does anyone have any ideas about what's going wrong? I have no idea
> >>>>> why the signal would be blocked.
> >>>>
> >>>> Maybe it's not blocked but merged with another (previous) sigchild?
> >>>
> >>> Yep, I just traced it and that's exactly what's happening. So I think
> >>> the right thing to do here is to waitpid() in a loop in the restorer
> >>> blob's sigchld_handler so that we make sure to collect all the
> >>> processes that have died?
> >>
> >> No, we can't call the waitpid() in pie/restore.c's sigchil_handler()
> >> since we must _leave_ the zombie in zombie state :)
> > 
> > Oh, good point, duh :). Perhaps we should just past a list of the
> > zombie pids to the restorer blob, and then the handler can look when
> > it gets a SIGCHLD to see how many are left and wake up the main
> > process when they're all in Z state?
> 
> This is exactly what's happening right now -- restorer blob check for
> zombies to exit by making them do it one-by-one (using zombies_in_progress
> lock) and catching the sigchild signal. Looks like a bug/race in there,
> but looking at the code I don't see where :\

Actually we use task_entries->zombie_lock, zombies_inprogress is a
counter. I can find a bug in code too. Tycho, could you execute strace
-o strace.log -s 256 -f criu restore -v4 -o restore.log ... and send us
strace.log and restore.log?

Thanks,
Andrew


More information about the CRIU mailing list