[CRIU] crash in pb_read_one?

Pavel Emelyanov xemul at parallels.com
Tue Sep 16 09:52:35 PDT 2014


On 09/16/2014 08:48 PM, Tycho Andersen wrote:
> On Tue, Sep 16, 2014 at 08:43:21PM +0400, Pavel Emelyanov wrote:
>> On 09/16/2014 07:16 PM, Tycho Andersen wrote:
>>> Hi all,
>>>
>>> While working on fixing the missing pid patchset, I'm experiencing a
>>> crash in pb_read_one (a segfault). The patch that is causing it is:
>>>
>>> https://github.com/tych0/criu/commit/a3f901f116d951ce369cc6355aafd4cecd398c79
>>> (the branch is missing-pid)
>>
>> In your patch I've notices, that you call rst_mem_remap_ptr() on
>> the area allocated by plain xmalloc(). This won't work. Remapping
>> works only on rst_mem_alloc()-ed regions.
> 
> Ah, that is good to know :)
> 
>>> I get the following output for one of the tests on that branch:
>>>
>>> criu:/tmp/criu/test missing-pid 1 sudo ./zdtm.sh ns/static/session00
>>> ================================= CRIU CHECK =================================
>>> Error (timerfd.c:56): timerfd: No timerfd support for c/r: Inappropriate ioctl for device
>>> Error (cr-check.c:269): fdinfo doesn't contain the mnt_id field
>>> ============================= WARNING =============================
>>> Not all features needed for CRIU are merged to upstream kernel yet,
>>> so for now we maintain our own branch which can be cloned from:
>>> git://git.kernel.org/pub/scm/linux/kernel/git/gorcunov/linux-cr.git
>>> ===================================================================
>>> Execute zdtm/live/static/session00
>>> ./session00 --pidfile=session00.pid --outfile=session00.out
>>> /tmp/criu/test
>>> Dump 25642
>>> Restore
>>> cat: /proc/25686/maps: No such file or directory
>>> Check results 25686
>>> ./zdtm.sh: line 462: kill: (25686) - No such process
>>> Unable to stop session00 (25686)
>>> Test: zdtm/live/static/session00, Result: FAIL
>>> ==================================== ERROR ====================================
>>> Test: zdtm/live/static/session00, Namespace: 1
>>> Dump log   : /tmp/criu/test/dump/static/session00/25642/1/dump.log
>>> --------------------------------- grep Error ---------------------------------
>>> ------------------------------------- END -------------------------------------
>>> Restore log: /tmp/criu/test/dump/static/session00/25642/1/restore.log
>>> --------------------------------- grep Error ---------------------------------
>>> (00.044859)      1: Error (cr-restore.c:2155): before pb_read_one in 25686
>>> (00.045951)     14: Error (cr-restore.c:2155): before pb_read_one in 25695
>>> (00.045960)     14: Error (cr-restore.c:2157): after pb_read_one in 25695
>>> (00.045979)      5: Error (cr-restore.c:2155): before pb_read_one in 25692
>>> (00.045986)      5: Error (cr-restore.c:2157): after pb_read_one in 25692
>>> (00.046197)     11: Error (cr-restore.c:2155): before pb_read_one in 25693
>>> (00.046204)     11: Error (cr-restore.c:2157): after pb_read_one in 25693
>>> (00.046591)      8: Error (cr-restore.c:2155): before pb_read_one in 25699
>>> (00.046604)      8: Error (cr-restore.c:2157): after pb_read_one in 25699
>>> (00.046843)     16: Error (cr-restore.c:2155): before pb_read_one in 25702
>>> (00.046854)     16: Error (cr-restore.c:2157): after pb_read_one in 25702
>>> (00.046875)     15: Error (cr-restore.c:2155): before pb_read_one in 25701
>>> (00.046883)     15: Error (cr-restore.c:2157): after pb_read_one in 25701
>>> (00.047400)     12: Error (cr-restore.c:2155): before pb_read_one in 25696
>>> (00.047407)     12: Error (cr-restore.c:2157): after pb_read_one in 25696
>>> (00.047722)      9: Error (cr-restore.c:2155): before pb_read_one in 25698
>>> (00.047731)      9: Error (cr-restore.c:2157): after pb_read_one in 25698
>>> (00.048236)     13: Error (cr-restore.c:2155): before pb_read_one in 25700
>>> (00.048245)     13: Error (cr-restore.c:2157): after pb_read_one in 25700
>>> (00.048584)      7: Error (cr-restore.c:2155): before pb_read_one in 25697
>>> (00.048591)      7: Error (cr-restore.c:2157): after pb_read_one in 25697
>>> (00.187746) Error (cr-restore.c:1162): 25686 killed by signal 11
>>> (00.187815) Error (cr-restore.c:1787): Restoring FAILED.
>>> ------------------------------------- END -------------------------------------
>>> ================================= ERROR OVER =================================
>>>
>>> It looks like the init task is crashing when reading its creds. Does anyone
>>> have any idea why this might be?
>>
>> But according to before/after pb_read_one logs crash happens after it.
> 
> It is task 1 that is crashing, which never prints after pb_read_one.
> 
>>> Also, the reason I need to pass the child pids into the restorer blob
>>> is because of shmem; we can't block after restore_fs(), because shmem
>>> is restored inside the restore blob, and some calls in restore_fs
>>> block until shmem is restored.
>>
>> I'm not sure I understand this. These are helpers, why do we care about
>> shmem/fs when talking about them?
> 
> We don't. The problem is that we need to wait on the helpers
> somewhere. 

Hm... This somewhere should be strictly after all files from this
helper has been opened. This can be pretty well determined by the
remap->users count. Next, when creating such helpers we can feed
0 into clone flag's exit_signal field, thus causing this particular
child to auto-reap, so once the remap->users count hits zero we
can just shoot it with SIGKILL.

> In my original patch we had a stage right after
> restore_fs() that blocked everything until after all restore_fs calls
> were complete. The problem is that in shmem_wait_and_open(), we block
> until shmem is set up, but that is only set up after restore_fs(), in
> the actual restorer blob. So if we synchronize right after
> restore_fs(), we deadlock.
> 
> Tycho
> 
>>> We can't wait on the pstree helpers
>>> until after restore_fs is done, so I think the best solution is to
>>> just wait on them after the restore stage is over.
>>>
>>> Any thoughts are much appreciated, I've been messing around with this
>>> for a while and still haven't managed to get it right.
>>>
>>> Tycho
>>> _______________________________________________
>>> CRIU mailing list
>>> CRIU at openvz.org
>>> https://lists.openvz.org/mailman/listinfo/criu
>>> .
>>>
>>
> .
> 



More information about the CRIU mailing list