[CRIU] hang when restoring container with zombies

Mon Jul 20 06:22:50 PDT 2015

On Mon, Jul 20, 2015 at 06:51:03AM -0600, Tycho Andersen wrote:
> On Mon, Jul 20, 2015 at 06:49:12AM -0600, Tycho Andersen wrote:
> > On Mon, Jul 20, 2015 at 03:40:36PM +0300, Pavel Emelyanov wrote:
> > > On 07/20/2015 03:16 PM, Andrew Vagin wrote:
> > > > On Fri, Jul 17, 2015 at 09:20:28PM +0300, Pavel Emelyanov wrote:
> > > >> On 07/17/2015 07:41 PM, Tycho Andersen wrote:
> > > >>> On Fri, Jul 17, 2015 at 07:15:55PM +0300, Pavel Emelyanov wrote:
> > > >>>> On 07/17/2015 06:36 PM, Tycho Andersen wrote:
> > > >>>>> Hi all,
> > > >>>>>
> > > >>>>> I'm experiencing a hang when restoring a process with zombies; the
> > > >>>>> zombies exit, but the parent process (in this case the container's
> > > >>>>> init) isn't getting the SIGCHLD, so it just gets stuck waiting for
> > > >>>>> zombies_inprogress. The parent process' /proc/pid/status is below, and
> > > >>>>> it doesn't seem to be blocking SIGCHLD and there are no pending
> > > >>>>> signals. I stuck a printf in the sigchld_handler in the restorer blob,
> > > >>>>> and it does get called for sid helpers, but not for the zombie
> > > >>>>> processes.
> > > >>>>>
> > > >>>>> Does anyone have any ideas about what's going wrong? I have no idea
> > > >>>>> why the signal would be blocked.
> > > >>>>
> > > >>>> Maybe it's not blocked but merged with another (previous) sigchild?
> > > >>>
> > > >>> Yep, I just traced it and that's exactly what's happening. So I think
> > > >>> the right thing to do here is to waitpid() in a loop in the restorer
> > > >>> blob's sigchld_handler so that we make sure to collect all the
> > > >>> processes that have died?
> > > >>
> > > >> No, we can't call the waitpid() in pie/restore.c's sigchil_handler()
> > > >> since we must _leave_ the zombie in zombie state :)
> > > > 
> > > > We can try to use waitid with WNOWAIT
> > > > """
> > > > WNOWAIT     Leave the child in a waitable state; a later wait call  can
> > > > 	   be used to again retrieve the child status information.
> > > > """
> > > 
> > > This would mean that we rework the existing zombie wait logic. I'm OK with
> > > it, but would appreciate if we find out what's wrong with the existing one :)
> > 
> > The problem is that it's getting coalesced with a helper task's
> > SIGCHLD, so if the helper exits first and then the zombie, we miss the
> > zombie's exit.

You are right. I thought helpers are dying on one of previous stages.

> > 
> > I'll try to send a patch with WNOWAIT.
> 
> I guess another option would be to rename zombie_lock to something
> else, and just have the helpers block on that lock too. I can do
> either, let me know which you prefer.

I vote for waitid. In this case we will able to remove zombie_lock.

Thanks,
Andrew

> 
> Tycho
> _______________________________________________
> CRIU mailing list
> CRIU at openvz.org
> https://lists.openvz.org/mailman/listinfo/criu