[CRIU] crash in pb_read_one?

Tue Sep 16 12:50:09 PDT 2014

On Tue, Sep 16, 2014 at 11:42:17PM +0400, Pavel Emelyanov wrote:
> On 09/16/2014 11:32 PM, Tycho Andersen wrote:
> > Hi Pavel,
> > 
> > On Tue, Sep 16, 2014 at 10:07:52PM +0400, Pavel Emelyanov wrote:
> >> On 09/16/2014 09:44 PM, Tycho Andersen wrote:
> >>> Hi Pavel,
> >>>
> >>> On Tue, Sep 16, 2014 at 12:02:19PM -0500, Tycho Andersen wrote:
> >>>>>
> >>>>> Hm... This somewhere should be strictly after all files from this
> >>>>> helper has been opened. This can be pretty well determined by the
> >>>>> remap->users count. Next, when creating such helpers we can feed
> >>>>> 0 into clone flag's exit_signal field, thus causing this particular
> >>>>> child to auto-reap, so once the remap->users count hits zero we
> >>>>> can just shoot it with SIGKILL.
> >>>>
> >>>> Ah, that sounds like a better approach. Actually I don't think we need
> >>>> to shoot it, we can just synchronize it to the end of the RESTORE
> >>>> stage and it should Just Work. I will give that a try, seems much
> >>>> cleaner than messing around with rst memory.
> >>
> >> Hm... Then we don't need the users counter as well. Just auto-reap.
> >>
> >>> Actually it looks like the clone flags for the helpers are 0, but they
> >>> still aren't auto-reaped when they exit (i.e. they are zombies, which
> >>> need a wait() call). What am I missing?
> >>
> >> ret = clone(restore_task_with_children, ca.stack_ptr,
> >>                         ca.clone_flags | SIGCHLD, &ca);
> >>
> >> This "| SIGCHLD" reaps auto-reap.
> > 
> > When I do this I get something like,
> > 
> > pie: 5: Collect a zombie with (pid 17, 17)
> > 
> > in the log. I think this means it is working, but that we still need
> > to pass down the helper PIDs so that we can ignore them when they are
> 
> :( Can we make all this helpers be root's children to have this list
> only for the root task?

I don't know. Some helpers are used to restore gids and sids, and so
they (might?) need to be in the right place in the tree.

> > reaped by the restorer blob's handler. Also, isn't there a race where
> > if the restore finishes entirely before the handler actually dies,
> > that the restored process gets a SIGCHLD?
> 
> We've solved this with stages. I can't tell you the full story, it
> was quite a while ago :) but the final staging we have right now
> does prevents us from restored tasks seeing "wrong" handlers or
> alien signals.

Yes, this is very sticky. I've been messing with it on and off for a
week and still haven't gotten it right :). I think the original
solution at the start of this thread might be the best one, I just
don't know why the crash was happening there.

Tycho

> > I think I am seeing something like this in the session00 test.
> > 
> > Tycho
> > .
> > 
>