[CRIU] problem in post_open_unix_sk?

Tycho Andersen tycho.andersen at canonical.com
Tue Jul 14 08:14:48 PDT 2015


On Tue, Jul 14, 2015 at 03:04:33PM +0300, Pavel Emelyanov wrote:
> On 07/01/2015 02:38 AM, Tycho Andersen wrote:
> > Hi Pavel,
> > 
> > On Wed, Jun 24, 2015 at 12:17:10AM +0300, Pavel Emelyanov wrote:
> >> On 06/23/2015 09:25 PM, Tycho Andersen wrote:
> >>> Hi all,
> >>>
> >>> When trying to c/r wiley (i.e. systemd-based) containers, I'm getting
> >>> crashes like the following sometimes:
> >>>
> >>> http://paste.ubuntu.com/11763264/
> >>> (00.135271)    396: Error (sk-unix.c:701): Can't connect 0x6d126 socket: Connection refused
> >>>
> >>> Any ideas as to what this might be?
> >>
> >> I've found several reasons for ECONNREFUSED
> >>
> >> 1. Target path/name doesn't exist
> >> 2. Target path is not socket
> >> 3. Target stream socket is not listen()-ing
> >> 4. Target dgram socket is connect()-ed to someone else
> >>
> >> Can you check the images for what kind of sockets CRIU is trying to
> >> interconnect?
> > 
> > http://paste.ubuntu.com/11801928/ is the contents of unixsk.img (the
> > log is http://paste.ubuntu.com/11801992/). In particular, it looks
> > like this is the culprit:
> > 
> [snip]
> > 
> > I'm not sure how the path is interpreted though; it's not in the root,
> > in any case, and the base64 decoding is just an integer. It's peer's
> > name is base64 encoded /run/systemd/private, so I guess this is a
> > client of that?
> 
> Ah, I see :) This is a race. Look at the logs:
> 
> (00.135950)      1: Opening standalone socket (id 0xb ino 0x9422f peer 0)
> (00.135974)    353: Error (sk-unix.c:701): Can't connect 0x947c4 socket: Connection refused
> (00.136390)      1: Error (cr-restore.c:1228): 353 exited, status=1
> (00.136407)      1: 	Putting 0x9422f into listen state
> 
> The sockets 0x9422f is in listen state (according to the image) and the
> 0x947c4 is connected to it. On restore the process 1 opens the first socket,
> then 353 tries to connect the second to it and fails since the target socket
> is NOT (yet) listening. Putting one into listen happens later -- the last
> line in the logs says that.

Ah, yep. Looks like the right solution would be to add a

futex_t listen;

to unix_sk_info and wait on that we are connect()ing, and signal it
once we listen()?

Tycho


More information about the CRIU mailing list