[CRIU] problem restoring unix queues?

Mon Jul 20 10:08:35 PDT 2015

On 07/20/2015 06:25 PM, Tycho Andersen wrote:
> On Mon, Jul 20, 2015 at 05:43:38PM +0300, Pavel Emelyanov wrote:
>> On 07/20/2015 05:28 PM, Tycho Andersen wrote:
>>> On Mon, Jul 20, 2015 at 02:13:52PM +0300, Pavel Emelyanov wrote:
>>>> On 07/18/2015 02:17 AM, Tycho Andersen wrote:
>>>>> On Fri, Jul 17, 2015 at 07:13:55PM +0300, Pavel Emelyanov wrote:
>>>>>> On 07/17/2015 05:33 PM, Tycho Andersen wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Sometimes I see something like:
>>>>>>>
>>>>>>> (00.095976)     77: Error (sk-queue.c:238): Failed to send packet: Resource temporarily unavailable
>>>>>>>
>>>>>>> when restoring (full log here: http://paste.ubuntu.com/11893011/).
>>>>>>>
>>>>>>> The sk-queues.img is: http://paste.ubuntu.com/11893015/ so I don't
>>>>>>> /think/ it should be filling the buffer, so I'm not sure why we'd get
>>>>>>> EAGAIN.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>
>>>>>> Is it stream or datagram socket? AFAIK the EAGAIN is only reported when
>>>>>> you hit socket buffer limit, but we try to raise one :\
>>>>>
>>>>> Assuming the "id" is the same as "id_for" above, it looks like it is
>>>>> SOCK_DGRAM (output from unixsk.img):
>>>>>
>>>>>         {
>>>>>             "id": 14, 
>>>>>             "ino": 16798503, 
>>>>>             "type": 2, # SOCK_DGRAM == 2
>>>>>             "state": 7, 
>>>>>             "flags": "0x80802", 
>>>>>             "uflags": "0x0", 
>>>>>             "backlog": 0, 
>>>>>             "peer": 0, 
>>>>>             "fown": {
>>>>>                 "uid": 0, 
>>>>>                 "euid": 0, 
>>>>>                 "signum": 0, 
>>>>>                 "pid_type": 0, 
>>>>>                 "pid": 0
>>>>>             }, 
>>>>>             "opts": {
>>>>>                 "so_sndbuf": 212992, 
>>>>>                 "so_rcvbuf": 16777216, 
>>>>>                 "so_snd_tmo_sec": 0, 
>>>>>                 "so_snd_tmo_usec": 0, 
>>>>>                 "so_rcv_tmo_sec": 0, 
>>>>>                 "so_rcv_tmo_usec": 0, 
>>>>>                 "reuseaddr": true, 
>>>>>                 "so_priority": 0, 
>>>>>                 "so_rcvlowat": 1, 
>>>>>                 "so_mark": 0, 
>>>>>                 "so_passcred": true, 
>>>>>                 "so_passsec": true, 
>>>>>                 "so_dontroute": false, 
>>>>>                 "so_no_check": false
>>>>>             }, 
>>>>>             "name": "L3J1bi9zeXN0ZW1kL2pvdXJuYWwvc29ja2V0AA==\n", 
>>>>>             "file_perms": {
>>>>>                 "mode": 49590, 
>>>>>                 "uid": 0, 
>>>>>                 "gid": 0
>>>>>             }
>>>>>         }, 
>>>>>
>>>>> Based on my read of the criu/kernel sources, we're setting
>>>>> sk->sk_sndbuf, while the kernel source in unix_dgram_sndmsg is checking
>>>>> sk->sk_receive_queue for sending EAGAIN; since we pass the backlog of
>>>>> 0 to listen() (viz. above), I guess this is what's causing the
>>>>> problem? Seems like a backlog of 0 must be wrong, since the queue is
>>>>> of non-zero length.
>>>>
>>>> Wait a second, if you send packets to socket it cannot be in listen
>>>> state. Moreover, according to your image the state of the socket is
>>>> 7 which is CLOSED :)
>>>
>>> Hrm. So perhaps we should not be sending stuff to this socket at all?
>>> Should we add a check for TCP_CLOSED to guard the call to
>>> dump_sk_queue?
>>
>> We dump only read queue that can be alive for closed sockets. E.g. the
>> test zdtm/live/static/socket02.c does exactly that.
> 
> I see. So isn't it possible then that the socket client end was
> created from a server that did listen(..., 0), 

But sockets we're having here are SOCK_DGRAM, aren't they?

> and that's what is
> causing the problem? The backlog is the only thing that I can see that
> can cause EAGAIN from unix_dgram_sendmsg().
> 
> For the example above, it looks like the socket that did a listen().
> The peer that is failing to send is:
> 
> # 0x1005327 == 16798503, the peer above
> # 0x1005ffe == 16801790, the peer below
> 
> (00.095929)     77:   Connect 0x1005ffe to 0x1005327
> (00.095937)     77: Trying to restore recv queue for 14
> (00.095949)     77:   Restoring 357-bytes skb for 14
> (00.095976)     77: Error (sk-queue.c:238): Failed to send packet: Resource temporarily unavailable

Ah it looks like I finally got what you mean :)

There are two places in unix_dgram_sendmsg that result in EAGAIN.
First is hitting send socket sndbuf, we (seem to) address that
by raising it with sockopt. But there's another one -- check for
unix_recvq_full(other). It checks for the number of packets in
queue doesn't exceed the max_ack_backlog :(

Are you talking about it?

If yes, how has it happened that datagram socket has more data
packets (as we see on dump) than this limit?

>         {
>             "id": 44, 
>             "ino": 16801790, 
>             "type": 2, 
>             "state": 7, 
>             "flags": "0x80002", 
>             "uflags": "0x0", 
>             "backlog": 0, 
>             "peer": 16798503, 
>             "fown": {
>                 "uid": 0, 
>                 "euid": 0, 
>                 "signum": 0, 
>                 "pid_type": 0, 
>                 "pid": 0
>             }, 
>             "opts": {
>                 "so_sndbuf": 16777216, 
>                 "so_rcvbuf": 212992, 
>                 "so_snd_tmo_sec": 10, 
>                 "so_snd_tmo_usec": 0, 
>                 "so_rcv_tmo_sec": 0, 
>                 "so_rcv_tmo_usec": 0, 
>                 "reuseaddr": false, 
>                 "so_priority": 0, 
>                 "so_rcvlowat": 1, 
>                 "so_mark": 0, 
>                 "so_passcred": false, 
>                 "so_passsec": false, 
>                 "so_dontroute": false, 
>                 "so_no_check": false
>             }, 
>             "name": ""
>         }
> 
> Both ends of this have a backlog of 0 (although only the "server" side
> matters). So the socket id 14 does a listen(..., 0), socket id 44 does
> a connect() to it, and then tries to send stuff to the queue and fails
> with EAGAIN? (Looks like both ends of this are closed which is also
> strange...)
> 
> Tycho
> .
>