[CRIU] hang when restoring container with zombies

Tycho Andersen tycho.andersen at canonical.com
Mon Jul 20 08:48:43 PDT 2015


On Mon, Jul 20, 2015 at 06:39:29PM +0300, Andrey Wagin wrote:
> 2015-07-20 18:31 GMT+03:00 Tycho Andersen <tycho.andersen at canonical.com>:
> > On Mon, Jul 20, 2015 at 05:53:59PM +0300, Pavel Emelyanov wrote:
> >> On 07/20/2015 04:22 PM, Andrew Vagin wrote:
> >>
> >> >>>
> >> >>> I'll try to send a patch with WNOWAIT.
> >> >>
> >> >> I guess another option would be to rename zombie_lock to something
> >> >> else, and just have the helpers block on that lock too. I can do
> >> >> either, let me know which you prefer.
> >> >
> >> > I vote for waitid. In this case we will able to remove zombie_lock.
> >>
> >> I agree. But keep in mind, that the sigchild handler is also responsible
> >> for picking up tasks that died due to some errors and propagating this
> >> error back to criu main process.
> >
> > Yes, I think we can be more explicit about this by passing a list of
> > the zombies to the restorer blob (vs. just assuming anything during
> > CR_STATE_RESTORE_SIGCHILD that dies is a zombie) if you want, sort of
> > how we do it with the helpers. I think we can just make one list of
> > expected_deaths and use that for zombies and helpers and waitid() on
> > it with WNOHANG | WNOWAIT.
> 
> We need to use WNOWAIT only for zombies. And  I don't understand why
> do we need to use WNOHANG if we want to wait events.

I guess it depends on how we implement it. I was thinking of leaving
the zombie code in the signal handler and wake up the futex from
there, but we could ignore zombies in the signal handler and do a
blocking wait instead of blocking on the futex. That sounds simpler,
and gets rid of a futex, so I'm all for it.

Tycho


More information about the CRIU mailing list