[Devel] [PATCH VZ9] fs/fuse kio: fix hang rpc over rdma io
Alexey Kuznetsov
kuznet at virtuozzo.com
Thu May 8 14:11:39 MSK 2025
Ack
On Thu, May 8, 2025 at 12:26 PM Liu Kui <kui.liu at virtuozzo.com> wrote:
>
> When a large msg is being sent out over rdma and in the stage of waiting
> for read ack from peer, it is moved from rio->write_queue to rio->active_txs.
> However the msg in rio->active_txs is not checked by pcs_rdma_next_timeout()
> to return a correct timeout back to rpc, as a result the rpc timer is not
> started. When the peer somehow becomes unresponsive, the msg at rio->active_txs
> can be stuck at waiting for read ack stage forever, because it can't be killed
> by the calendar timer since it's now under network I/O. As a result, the rpc
> can hang forever without detecting the stuck msg at underlying rdma io.
>
> Apparently pcs_rdma_next_timeout should return the next timeout based on
> first msg in rio->active_txs.
>
> Fixes: #VSTOR-105982
> https://virtuozzo.atlassian.net/browse/VSTOR-105982
>
> Signed-off-by: Liu Kui <kui.liu at virtuozzo.com>
> ---
> fs/fuse/kio/pcs/pcs_rdma_io.c | 15 +++++++++++----
> 1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/fs/fuse/kio/pcs/pcs_rdma_io.c b/fs/fuse/kio/pcs/pcs_rdma_io.c
> index 2755b13fb8a5..6fa38338ad0c 100644
> --- a/fs/fuse/kio/pcs/pcs_rdma_io.c
> +++ b/fs/fuse/kio/pcs/pcs_rdma_io.c
> @@ -1668,14 +1668,21 @@ static unsigned long pcs_rdma_next_timeout(struct pcs_netio *netio)
> struct pcs_rdmaio *rio = rio_from_netio(netio);
> struct pcs_rpc *ep = netio->parent;
> struct pcs_msg *msg;
> + struct rio_tx *tx;
>
> BUG_ON(!mutex_is_locked(&ep->mutex));
>
> - if (list_empty(&rio->write_queue))
> - return 0;
> + if (!list_empty(&rio->active_txs)) {
> + tx = list_first_entry(&rio->active_txs, struct rio_tx, list);
> + return tx->msg->start_time + rio->send_timeout;
> + }
>
> - msg = list_first_entry(&rio->write_queue, struct pcs_msg, list);
> - return msg->start_time + rio->send_timeout;
> + if (!list_empty(&rio->write_queue)) {
> + msg = list_first_entry(&rio->write_queue, struct pcs_msg, list);
> + return msg->start_time + rio->send_timeout;
> + }
> +
> + return 0;
> }
>
> static int pcs_rdma_sync_send(struct pcs_netio *netio, struct pcs_msg *msg)
> --
> 2.39.5 (Apple Git-154)
More information about the Devel
mailing list