[CRIU] problem restoring unix queues?

Mon Jul 20 10:45:19 PDT 2015

On Mon, Jul 20, 2015 at 08:08:35PM +0300, Pavel Emelyanov wrote:
> On 07/20/2015 06:25 PM, Tycho Andersen wrote:
> > On Mon, Jul 20, 2015 at 05:43:38PM +0300, Pavel Emelyanov wrote:
> >> On 07/20/2015 05:28 PM, Tycho Andersen wrote:
> >>> On Mon, Jul 20, 2015 at 02:13:52PM +0300, Pavel Emelyanov wrote:
> >>>> On 07/18/2015 02:17 AM, Tycho Andersen wrote:
> >>>>> On Fri, Jul 17, 2015 at 07:13:55PM +0300, Pavel Emelyanov wrote:
> >>>>>> On 07/17/2015 05:33 PM, Tycho Andersen wrote:
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> Sometimes I see something like:
> >>>>>>>
> >>>>>>> (00.095976)     77: Error (sk-queue.c:238): Failed to send packet: Resource temporarily unavailable
> >>>>>>>
> >>>>>>> when restoring (full log here: http://paste.ubuntu.com/11893011/).
> >>>>>>>
> >>>>>>> The sk-queues.img is: http://paste.ubuntu.com/11893015/ so I don't
> >>>>>>> /think/ it should be filling the buffer, so I'm not sure why we'd get
> >>>>>>> EAGAIN.
> >>>>>>>
> >>>>>>> Thoughts?
> >>>>>>
> >>>>>> Is it stream or datagram socket? AFAIK the EAGAIN is only reported when
> >>>>>> you hit socket buffer limit, but we try to raise one :\
> >>>>>
> >>>>> Assuming the "id" is the same as "id_for" above, it looks like it is
> >>>>> SOCK_DGRAM (output from unixsk.img):
> >>>>>
> >>>>>         {
> >>>>>             "id": 14, 
> >>>>>             "ino": 16798503, 
> >>>>>             "type": 2, # SOCK_DGRAM == 2
> >>>>>             "state": 7, 
> >>>>>             "flags": "0x80802", 
> >>>>>             "uflags": "0x0", 
> >>>>>             "backlog": 0, 
> >>>>>             "peer": 0, 
> >>>>>             "fown": {
> >>>>>                 "uid": 0, 
> >>>>>                 "euid": 0, 
> >>>>>                 "signum": 0, 
> >>>>>                 "pid_type": 0, 
> >>>>>                 "pid": 0
> >>>>>             }, 
> >>>>>             "opts": {
> >>>>>                 "so_sndbuf": 212992, 
> >>>>>                 "so_rcvbuf": 16777216, 
> >>>>>                 "so_snd_tmo_sec": 0, 
> >>>>>                 "so_snd_tmo_usec": 0, 
> >>>>>                 "so_rcv_tmo_sec": 0, 
> >>>>>                 "so_rcv_tmo_usec": 0, 
> >>>>>                 "reuseaddr": true, 
> >>>>>                 "so_priority": 0, 
> >>>>>                 "so_rcvlowat": 1, 
> >>>>>                 "so_mark": 0, 
> >>>>>                 "so_passcred": true, 
> >>>>>                 "so_passsec": true, 
> >>>>>                 "so_dontroute": false, 
> >>>>>                 "so_no_check": false
> >>>>>             }, 
> >>>>>             "name": "L3J1bi9zeXN0ZW1kL2pvdXJuYWwvc29ja2V0AA==\n", 
> >>>>>             "file_perms": {
> >>>>>                 "mode": 49590, 
> >>>>>                 "uid": 0, 
> >>>>>                 "gid": 0
> >>>>>             }
> >>>>>         }, 
> >>>>>
> >>>>> Based on my read of the criu/kernel sources, we're setting
> >>>>> sk->sk_sndbuf, while the kernel source in unix_dgram_sndmsg is checking
> >>>>> sk->sk_receive_queue for sending EAGAIN; since we pass the backlog of
> >>>>> 0 to listen() (viz. above), I guess this is what's causing the
> >>>>> problem? Seems like a backlog of 0 must be wrong, since the queue is
> >>>>> of non-zero length.
> >>>>
> >>>> Wait a second, if you send packets to socket it cannot be in listen
> >>>> state. Moreover, according to your image the state of the socket is
> >>>> 7 which is CLOSED :)
> >>>
> >>> Hrm. So perhaps we should not be sending stuff to this socket at all?
> >>> Should we add a check for TCP_CLOSED to guard the call to
> >>> dump_sk_queue?
> >>
> >> We dump only read queue that can be alive for closed sockets. E.g. the
> >> test zdtm/live/static/socket02.c does exactly that.
> > 
> > I see. So isn't it possible then that the socket client end was
> > created from a server that did listen(..., 0), 
> 
> But sockets we're having here are SOCK_DGRAM, aren't they?
> 
> > and that's what is
> > causing the problem? The backlog is the only thing that I can see that
> > can cause EAGAIN from unix_dgram_sendmsg().
> > 
> > For the example above, it looks like the socket that did a listen().
> > The peer that is failing to send is:
> > 
> > # 0x1005327 == 16798503, the peer above
> > # 0x1005ffe == 16801790, the peer below
> > 
> > (00.095929)     77:   Connect 0x1005ffe to 0x1005327
> > (00.095937)     77: Trying to restore recv queue for 14
> > (00.095949)     77:   Restoring 357-bytes skb for 14
> > (00.095976)     77: Error (sk-queue.c:238): Failed to send packet: Resource temporarily unavailable
> 
> Ah it looks like I finally got what you mean :)
> 
> There are two places in unix_dgram_sendmsg that result in EAGAIN.
> First is hitting send socket sndbuf, we (seem to) address that
> by raising it with sockopt. But there's another one -- check for
> unix_recvq_full(other). It checks for the number of packets in
> queue doesn't exceed the max_ack_backlog :(
> 
> Are you talking about it?

Yes :)

> If yes, how has it happened that datagram socket has more data
> packets (as we see on dump) than this limit?

I'm not sure. I wonder if something is getting screwed up because both
ends of the socket are closed (it looks like parse_rtattr zeros the
buffer it writes to, so maybe that's where the zero is coming from?).
It should be easy enough to test, I'll try to play with it this
afternoon.

Tycho

> >         {
> >             "id": 44, 
> >             "ino": 16801790, 
> >             "type": 2, 
> >             "state": 7, 
> >             "flags": "0x80002", 
> >             "uflags": "0x0", 
> >             "backlog": 0, 
> >             "peer": 16798503, 
> >             "fown": {
> >                 "uid": 0, 
> >                 "euid": 0, 
> >                 "signum": 0, 
> >                 "pid_type": 0, 
> >                 "pid": 0
> >             }, 
> >             "opts": {
> >                 "so_sndbuf": 16777216, 
> >                 "so_rcvbuf": 212992, 
> >                 "so_snd_tmo_sec": 10, 
> >                 "so_snd_tmo_usec": 0, 
> >                 "so_rcv_tmo_sec": 0, 
> >                 "so_rcv_tmo_usec": 0, 
> >                 "reuseaddr": false, 
> >                 "so_priority": 0, 
> >                 "so_rcvlowat": 1, 
> >                 "so_mark": 0, 
> >                 "so_passcred": false, 
> >                 "so_passsec": false, 
> >                 "so_dontroute": false, 
> >                 "so_no_check": false
> >             }, 
> >             "name": ""
> >         }
> > 
> > Both ends of this have a backlog of 0 (although only the "server" side
> > matters). So the socket id 14 does a listen(..., 0), socket id 44 does
> > a connect() to it, and then tries to send stuff to the queue and fails
> > with EAGAIN? (Looks like both ends of this are closed which is also
> > strange...)
> > 
> > Tycho
> > .
> > 
>