[CRIU] crash in pb_read_one?

Tue Sep 16 12:32:58 PDT 2014

Hi Pavel,

On Tue, Sep 16, 2014 at 10:07:52PM +0400, Pavel Emelyanov wrote:
> On 09/16/2014 09:44 PM, Tycho Andersen wrote:
> > Hi Pavel,
> > 
> > On Tue, Sep 16, 2014 at 12:02:19PM -0500, Tycho Andersen wrote:
> >>>
> >>> Hm... This somewhere should be strictly after all files from this
> >>> helper has been opened. This can be pretty well determined by the
> >>> remap->users count. Next, when creating such helpers we can feed
> >>> 0 into clone flag's exit_signal field, thus causing this particular
> >>> child to auto-reap, so once the remap->users count hits zero we
> >>> can just shoot it with SIGKILL.
> >>
> >> Ah, that sounds like a better approach. Actually I don't think we need
> >> to shoot it, we can just synchronize it to the end of the RESTORE
> >> stage and it should Just Work. I will give that a try, seems much
> >> cleaner than messing around with rst memory.
> 
> Hm... Then we don't need the users counter as well. Just auto-reap.
> 
> > Actually it looks like the clone flags for the helpers are 0, but they
> > still aren't auto-reaped when they exit (i.e. they are zombies, which
> > need a wait() call). What am I missing?
> 
> ret = clone(restore_task_with_children, ca.stack_ptr,
>                         ca.clone_flags | SIGCHLD, &ca);
> 
> This "| SIGCHLD" reaps auto-reap.

When I do this I get something like,

pie: 5: Collect a zombie with (pid 17, 17)

in the log. I think this means it is working, but that we still need
to pass down the helper PIDs so that we can ignore them when they are
reaped by the restorer blob's handler. Also, isn't there a race where
if the restore finishes entirely before the handler actually dies,
that the restored process gets a SIGCHLD? I think I am seeing
something like this in the session00 test.

Tycho