[CRIU] problem restoring unix queues?
Tycho Andersen
tycho.andersen at canonical.com
Wed Jul 22 08:41:29 PDT 2015
On Mon, Jul 20, 2015 at 11:45:19AM -0600, Tycho Andersen wrote:
> > But sockets we're having here are SOCK_DGRAM, aren't they?
> >
> > > and that's what is
> > > causing the problem? The backlog is the only thing that I can see that
> > > can cause EAGAIN from unix_dgram_sendmsg().
> > >
> > > For the example above, it looks like the socket that did a listen().
> > > The peer that is failing to send is:
> > >
> > > # 0x1005327 == 16798503, the peer above
> > > # 0x1005ffe == 16801790, the peer below
> > >
> > > (00.095929) 77: Connect 0x1005ffe to 0x1005327
> > > (00.095937) 77: Trying to restore recv queue for 14
> > > (00.095949) 77: Restoring 357-bytes skb for 14
> > > (00.095976) 77: Error (sk-queue.c:238): Failed to send packet: Resource temporarily unavailable
> >
> > Ah it looks like I finally got what you mean :)
> >
> > There are two places in unix_dgram_sendmsg that result in EAGAIN.
> > First is hitting send socket sndbuf, we (seem to) address that
> > by raising it with sockopt. But there's another one -- check for
> > unix_recvq_full(other). It checks for the number of packets in
> > queue doesn't exceed the max_ack_backlog :(
> >
> > Are you talking about it?
>
> Yes :)
>
> > If yes, how has it happened that datagram socket has more data
> > packets (as we see on dump) than this limit?
>
> I'm not sure. I wonder if something is getting screwed up because both
> ends of the socket are closed (it looks like parse_rtattr zeros the
> buffer it writes to, so maybe that's where the zero is coming from?).
> It should be easy enough to test, I'll try to play with it this
> afternoon.
So I played around with this yesterday and I found some interesting
things, although I have no idea what the cause of this is still:
1. "state" always reads 7, even when the sockets have not been closed
(you can see this by running sockets_dgram and inspecting the image)
2. the backlog as in the images seems to have no bearing on what the
value of sk_max_ack_backlog for the socket (perhaps it's something
else? I haven't tracked it down)
3. further, this backlog (as far as I can tell) can only be inherited
via sysctl, and the default is 10 on my (most?) systems, so it
_shouldn't_ fail because of the first write because it has enough.
So perhaps it's something else? I'm not sure where else the EAGAIN
would come from reading the code, though...
Tycho
More information about the CRIU
mailing list