[CRIU] hang when restoring container with zombies

Pavel Emelyanov xemul at parallels.com
Mon Jul 20 05:40:36 PDT 2015


On 07/20/2015 03:16 PM, Andrew Vagin wrote:
> On Fri, Jul 17, 2015 at 09:20:28PM +0300, Pavel Emelyanov wrote:
>> On 07/17/2015 07:41 PM, Tycho Andersen wrote:
>>> On Fri, Jul 17, 2015 at 07:15:55PM +0300, Pavel Emelyanov wrote:
>>>> On 07/17/2015 06:36 PM, Tycho Andersen wrote:
>>>>> Hi all,
>>>>>
>>>>> I'm experiencing a hang when restoring a process with zombies; the
>>>>> zombies exit, but the parent process (in this case the container's
>>>>> init) isn't getting the SIGCHLD, so it just gets stuck waiting for
>>>>> zombies_inprogress. The parent process' /proc/pid/status is below, and
>>>>> it doesn't seem to be blocking SIGCHLD and there are no pending
>>>>> signals. I stuck a printf in the sigchld_handler in the restorer blob,
>>>>> and it does get called for sid helpers, but not for the zombie
>>>>> processes.
>>>>>
>>>>> Does anyone have any ideas about what's going wrong? I have no idea
>>>>> why the signal would be blocked.
>>>>
>>>> Maybe it's not blocked but merged with another (previous) sigchild?
>>>
>>> Yep, I just traced it and that's exactly what's happening. So I think
>>> the right thing to do here is to waitpid() in a loop in the restorer
>>> blob's sigchld_handler so that we make sure to collect all the
>>> processes that have died?
>>
>> No, we can't call the waitpid() in pie/restore.c's sigchil_handler()
>> since we must _leave_ the zombie in zombie state :)
> 
> We can try to use waitid with WNOWAIT
> """
> WNOWAIT     Leave the child in a waitable state; a later wait call  can
> 	   be used to again retrieve the child status information.
> """

This would mean that we rework the existing zombie wait logic. I'm OK with
it, but would appreciate if we find out what's wrong with the existing one :)

-- Pavel



More information about the CRIU mailing list