Bugfix: SQ overflow #3145

chenBright · 2025-11-09T05:47:36Z

What problem does this PR solve?

Issue Number: resolve #3132

Problem Summary:

What is changed and the side effects?

Changed:

Side effects:

Performance effects:
Breaking backward compatibility:

Check List:

Please make sure your changes are compilable.
When providing us with a new feature, it is best to add related tests.
Please follow Contributor Covenant Code of Conduct.

chenBright · 2025-11-09T06:05:21Z

We encountered the same problem as #3132. This PR's solution is similar to #3132 (comment) and can solve this problem. PTAL @sunce4t @yanglimingcn

Copilot

Pull Request Overview

This PR refactors the RDMA window management by splitting the single _window_size variable into two separate variables: _remote_rq_window_size (tracking remote receive queue capacity) and _sq_window_size (tracking local send queue capacity). This provides more granular control over flow control in RDMA operations.

Split window size tracking into separate remote RQ and local SQ windows
Added SendType enum to distinguish between different types of send operations
Enhanced logging to differentiate between client and server handshake completion

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
src/brpc/rdma/rdma_endpoint.h	Declared two new atomic variables `_remote_rq_window_size` and `_sq_window_size` to replace the single `_window_size` variable
src/brpc/rdma/rdma_endpoint.cpp	Refactored window management logic to independently track local SQ and remote RQ capacity; added `SendType` enum; improved handshake logging; optimized zero-copy receive path

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/brpc/rdma/rdma_endpoint.cpp

yanglimingcn · 2025-11-17T03:25:47Z

Why is CQ divided into send_cq and recv_cq?

chenBright · 2025-11-17T06:02:18Z

We encountered an error: requests were not being sent, causing a large number of client timeouts.

 [E1008]Reached timeout=60000ms @Socket{id=13 fd=1160 addr=xxx:xx} (0x0x7f957c964ec0) rdma info={rdma_state=ON, handshake_state=ESTABLISHED, rdma_remote_rq_window_size=63, rdma_sq_window_size=0, rdma_local_window_capacity=125, rdma_remote_window_capacity=125, rdma_sbuf_head=57, rdma_sbuf_tail=120, rdma_rbuf_head=36, rdma_unacked_rq_wr=0, rdma_received_ack=0, rdma_unsolicited_sent=0, rdma_unsignaled_sq_wr=1, rdma_new_rq_wrs=0, }

From the RDMA connection information, we found that because ibv_req_notify_cq was only solicited, send WCs did not generate a CQEs. Without recv CQEs, send WCs could not be polled, so ·_sq_window_size` remained at 0. This is likely the reason why both the client and server are unable to send messages.

Using ibv_req_notify_cq with solicited_only=0 could solve this problem, but it would generate too many events. Therefore, we split the CQ into send_cq(solicited_only=0) and recv_cq(solicited_only=1).

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/brpc/socket.cpp

src/brpc/rdma/rdma_endpoint.cpp

src/brpc/rdma/rdma_endpoint.h

src/brpc/rdma/rdma_endpoint.cpp

yanglimingcn · 2025-11-17T08:16:42Z

We encountered an error: requests were not being sent, causing a large number of client timeouts.
 [E1008]Reached timeout=60000ms @Socket{id=13 fd=1160 addr=xxx:xx} (0x0x7f957c964ec0) rdma info={rdma_state=ON, handshake_state=ESTABLISHED, rdma_remote_rq_window_size=63, rdma_sq_window_size=0, rdma_local_window_capacity=125, rdma_remote_window_capacity=125, rdma_sbuf_head=57, rdma_sbuf_tail=120, rdma_rbuf_head=36, rdma_unacked_rq_wr=0, rdma_received_ack=0, rdma_unsolicited_sent=0, rdma_unsignaled_sq_wr=1, rdma_new_rq_wrs=0, }
From the RDMA connection information, we found that because ibv_req_notify_cq was only solicited, send WCs did not generate a CQEs. Without recv CQEs, send WCs could not be polled, so ·_sq_window_size` remained at 0. This is likely the reason why both the client and server are unable to send messages.

Using ibv_req_notify_cq with solicited_only=0 could solve this problem, but it would generate too many events. Therefore, we split the CQ into send_cq(solicited_only=0) and recv_cq(solicited_only=1).

With this modification, send_cq will generate one CQE for every 1/4 of the window？

src/brpc/rdma/rdma_endpoint.cpp

chenBright · 2025-11-17T09:17:51Z

With this modification, send_cq will generate one CQE for every 1/4 of the window？

Yes.

legionxiong · 2025-11-18T01:47:06Z

We encountered an error: requests were not being sent, causing a large number of client timeouts.
 [E1008]Reached timeout=60000ms @Socket{id=13 fd=1160 addr=xxx:xx} (0x0x7f957c964ec0) rdma info={rdma_state=ON, handshake_state=ESTABLISHED, rdma_remote_rq_window_size=63, rdma_sq_window_size=0, rdma_local_window_capacity=125, rdma_remote_window_capacity=125, rdma_sbuf_head=57, rdma_sbuf_tail=120, rdma_rbuf_head=36, rdma_unacked_rq_wr=0, rdma_received_ack=0, rdma_unsolicited_sent=0, rdma_unsignaled_sq_wr=1, rdma_new_rq_wrs=0, }
From the RDMA connection information, we found that because ibv_req_notify_cq was only solicited, send WCs did not generate a CQEs. Without recv CQEs, send WCs could not be polled, so ·_sq_window_size` remained at 0. This is likely the reason why both the client and server are unable to send messages.

Using ibv_req_notify_cq with solicited_only=0 could solve this problem, but it would generate too many events. Therefore, we split the CQ into send_cq(solicited_only=0) and recv_cq(solicited_only=1).

It is unnecessary to split CQ into send_cq and recv_cq, the reason that sliding window goes wrong is the precondition it relies on is not guaranteed in RoCE environment. The sliding window mechanism presumes that an local app receives the ack from remote app means that the underlying SQ has already been released, in a lossless IB environment it maybe true. But in RoCE environment, it is not guaranteed. Because the local device releases SQ only if it receives the ack to a message from remote device, but the ack might be lost and the remote side has processed the message, which means the remote app has received the message and sent an application layer ack to local app. Unfortunately, the local app layer received the ack to a message from remote app and then increased the window, but the device layer is still waiting for the ack from remote device to release the SQ. So the story is the window being increased wrongly for one or two times, till the window is absolutely wrong.

chenBright · 2025-11-18T02:40:58Z

Unfortunately, the local app layer received the ack to a message from remote app and then increased the window, but the device layer is still waiting for the ack from remote device to release the SQ.

@legionxiong f the device-level ACK is lost, is it impossible to release the SQ?

It is unnecessary to split CQ into send_cq and recv_cq, the reason that sliding window goes wrong is the precondition it relies on is not guaranteed in RoCE environment.

Is there a better solution?

legionxiong · 2025-11-18T04:04:16Z

the device-level ACK is lost

Local side will resend the message whose ACK was not received within rdma_ack_timeout, and remote side will do nothing but resend the ACK to the message that has been received before. Maybe we need another variable which checks the event of IBV_WC_SEND, it is guaranteed that SQs are released if we polled the work completion of IBV_WC_SEND. And the number of unsignaled SQs could be carried with wr_id.

chenBright · 2025-11-18T06:04:03Z

Maybe we need another variable which checks the event of IBV_WC_SEND, it is guaranteed that SQs are released if we polled the work completion of IBV_WC_SEND.

[E1008]Reached timeout=60000ms @socket{id=13 fd=1160 addr=xxx:xx} (0x0x7f957c964ec0) rdma info={rdma_state=ON, handshake_state=ESTABLISHED, rdma_remote_rq_window_size=63, rdma_sq_window_size=0, rdma_local_window_capacity=125, rdma_remote_window_capacity=125, rdma_sbuf_head=57, rdma_sbuf_tail=120, rdma_rbuf_head=36, rdma_unacked_rq_wr=0, rdma_received_ack=0, rdma_unsolicited_sent=0, rdma_unsignaled_sq_wr=1, rdma_new_rq_wrs=0, }

In this case, it seems we cannot poll any work completion of IBV_WC_SEND for a long time. With solicited_only=1, if device layer ACKs follows all application layer ACKs, we may be unable to poll IBV_WC_SEND after polling IBV_WC_RECV because there are no CQEs.

Maybe we need another variable which checks the event of IBV_WC_SEND

How can we check for the event of IBV_WC_SEND？

the number of unsignaled SQs could be carried with wr_id.

This solution for updating the SQ window is better.

randomkang · 2025-12-03T07:05:39Z

LGTM, @randomkang Please try the latest PR again; we haven't encountered the problem you mentioned.

The brpc version which i used is 1.14.1, does it matter?

yanglimingcn · 2025-12-03T07:44:27Z

I don't think it matters; just focus on the modifications to the RDMA section.

randomkang · 2025-12-08T03:46:01Z

I run three tasks, they lasts for 1742、1551、1852 minutes, and the error of "Fail to ibv_post_send: Cannot allocate memory" does not happpen again. @yanglimingcn @chenBright

yanglimingcn · 2025-12-08T06:07:00Z

LGTM

chenBright · 2025-12-08T12:42:13Z

I run three tasks, they lasts for 1742、1551、1852 minutes, and the error of "Fail to ibv_post_send: Cannot allocate memory" does not happpen again. @yanglimingcn @chenBright

@randomkang Is this using the latest commit?

randomkang · 2025-12-09T03:48:09Z

yes, i use the commit of "0794a476ee88e19f91cb9ce43166b4d2f90077b3"

yanglimingcn · 2025-12-09T06:18:53Z

In fact, I still don't understand the specific function of the 5/4 window in the latest Premiere Pro.

chenBright · 2025-12-14T15:32:58Z

In fact, I still don't understand the specific function of the 5/4 window in the latest Premiere Pro.

I think it would be more reasonable to add a sliding window to limit the number of IMM.

@randomkang Please try the latest PR again

yanglimingcn · 2025-12-15T01:34:05Z

So how should this window be set up reasonably? It seems the number of imm items generated is not quite certain?

chenBright · 2025-12-15T07:12:40Z

So how should this window be set up reasonably? It seems the number of imm items generated is not quite certain?

The window size is related to SendAck.

When _remote_window_capacity is divisible by 2, a maximum of 2 imm are required.

When _remote_window_capacity is not divisible by 2, a maximum of 3 imm are required.

Therefore, RESERVED_WR_NUM is a reasonable value.

brpc/src/brpc/rdma/rdma_endpoint.cpp

Lines 518 to 519 in 7cd0c25

    
           ep->_remote_window_capacity =  
        
               std::min(ep->_rq_size, remote_msg.sq_size) - RESERVED_WR_NUM,

brpc/src/brpc/rdma/rdma_endpoint.cpp

Lines 911 to 916 in 7cd0c25

    
           int RdmaEndpoint::SendAck(int num) { 
        
               if (_new_rq_wrs.fetch_add(num, butil::memory_order_relaxed) > _remote_window_capacity / 2) { 
        
                   return SendImm(_new_rq_wrs.exchange(0, butil::memory_order_relaxed)); 
        
               } 
        
               return 0; 
        
           }

yanglimingcn · 2025-12-15T08:03:57Z

Since RESERVED_WR_NUM is already present, is the 5/4 no longer needed?

chenBright · 2025-12-15T09:17:03Z

Since RESERVED_WR_NUM is already present, is the 5/4 no longer needed?

Yes. The capacity of RESERVED_WR_NUM is sufficient; by using the imm window to limit the number of imm elements, 5/4 is no longer needed.

randomkang · 2025-12-17T09:40:52Z

In fact, I still don't understand the specific function of the 5/4 window in the latest Premiere Pro.

I think it would be more reasonable to add a sliding window to limit the number of IMM.

@randomkang Please try the latest PR again

I run 2 tasks, and the error of "Fail to ibv_post_send: Cannot allocate memory" does not happpen again.

One last for 1790 mins, and fail due to "[wk-15] Thread E1217 16:32:30.900705 106575 12820477405150 external/brpc/src/brpc/rdma/rdma_helper.cpp:220 BlockAllocate] Fail to get block from memory pool". This error is not related to communication.

The other last for 1787mins, and fail due to "[tf-ps-7] W1217 04:18:39.676858 24875 78292958891373 tensorflow/contrib/pms/rpc/brpc_remote_worker.cc:383 operator()] Fail to recv tensorflow.RecvTensorResponse from 10.28.132.10:3642 : Aborted: [E10010][10.28.132.10:3642][E10010]Step 60851325530873895". I don't see the error message of brpc, maybe this error is caused by tensorflow. @chenBright

randomkang · 2025-12-17T10:02:44Z

I run three tasks, they lasts for 1742、1551、1852 minutes, and the error of "Fail to ibv_post_send: Cannot allocate memory" does not happpen again. @yanglimingcn @chenBright

By the way, these three tasks also failed in final.

The first task failed due to "[wk-8] E1204 21:01:41.101131 72310 62642098032795 external/brpc/src/brpc/rdma/block_pool.cpp:362 AllocBlockFrom] Fail to extend new region. You can set the size of memory pool larger. Refer to the help message of these flags: rdma_memory_pool_initial_size_mb, rdma_memory_pool_increase_size_mb, rdma_memory_pool_max_regions." This error is not related to communication.

The second task failed, but i don't see any error message. Maybe it is killed by the matchine failure.

The third task failed due to "tf op is stuck".

chenBright · 2025-12-17T11:22:44Z

@randomkang In summary, the error of "Fail to ibv_post_send: Cannot allocate memory" has been resolved. Right?

randomkang · 2025-12-17T11:34:25Z

@randomkang In summary, the error of "Fail to ibv_post_send: Cannot allocate memory" has been resolved. Right?

Yes, i think so.

chenBright · 2025-12-17T11:37:40Z

@yanglimingcn Please review it again.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (1)

src/brpc/rdma/rdma_endpoint.cpp:1520

After setting notified = true and requesting notifications for both CQs at lines 1497-1500, the code continues with the next iteration of the loop. However, at line 1520, notified is unconditionally reset to false when any completion events are found (cnt > 0). This means that if events arrive after notification but are found in the first poll, the notification state is incorrectly reset. This could lead to missing the re-notification step in subsequent iterations when no events are found.

                notified = true;
                continue;
            }
            if (!m->MoreReadEvents(&progress)) {
                break;
            }

            if (0 != ep->GetAndAckEvents(s)) {
                return;
            }

            // Restart polling from `recv_cq'.
            send = false;
            cq = ep->_resource->recv_cq;
            notified = false;
            continue;
        }
        notified = false;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/brpc/rdma/rdma_endpoint.cpp

yanglimingcn · 2025-12-22T09:18:33Z

LGTM

chenBright mentioned this pull request Nov 9, 2025

add pending signaled wrs checking #3137

Closed

chenBright requested a review from Copilot November 9, 2025 08:39

Copilot AI reviewed Nov 9, 2025

View reviewed changes

src/brpc/rdma/rdma_endpoint.cpp Show resolved Hide resolved

src/brpc/rdma/rdma_endpoint.cpp Outdated Show resolved Hide resolved

src/brpc/rdma/rdma_endpoint.cpp Outdated Show resolved Hide resolved

src/brpc/rdma/rdma_endpoint.cpp Outdated Show resolved Hide resolved

chenBright force-pushed the fix_rdma_sq_window branch from f17d597 to f9e09bd Compare November 9, 2025 08:47

yanglimingcn mentioned this pull request Nov 10, 2025

support gpu direct rdma #3144

Open

yanglimingcn reviewed Nov 10, 2025

View reviewed changes

src/brpc/rdma/rdma_endpoint.cpp Show resolved Hide resolved

yanglimingcn reviewed Nov 10, 2025

View reviewed changes

src/brpc/rdma/rdma_endpoint.cpp Show resolved Hide resolved

yanglimingcn reviewed Nov 10, 2025

View reviewed changes

src/brpc/rdma/rdma_endpoint.cpp Outdated Show resolved Hide resolved

yanglimingcn reviewed Nov 11, 2025

View reviewed changes

src/brpc/rdma/rdma_endpoint.cpp Show resolved Hide resolved

yanglimingcn reviewed Nov 11, 2025

View reviewed changes

src/brpc/rdma/rdma_endpoint.cpp Outdated Show resolved Hide resolved

sunce4t mentioned this pull request Nov 11, 2025

滑动窗口bug：Fail to ibv_post_send: Cannot allocate memory, window=3, sq_current=98 #3132

Closed

chenBright force-pushed the fix_rdma_sq_window branch 2 times, most recently from 46350e7 to 45251a9 Compare November 17, 2025 03:14

chenBright force-pushed the fix_rdma_sq_window branch from 45251a9 to db3673a Compare November 17, 2025 06:40

chenBright mentioned this pull request Nov 17, 2025

RDMA failure #2265

Closed

chenBright requested a review from Copilot November 17, 2025 07:26

Copilot started reviewing on behalf of chenBright November 17, 2025 07:27 View session

Copilot finished reviewing on behalf of chenBright November 17, 2025 07:29

Copilot AI reviewed Nov 17, 2025

View reviewed changes

chenBright force-pushed the fix_rdma_sq_window branch from db3673a to 2cdf06d Compare November 17, 2025 07:58

yanglimingcn reviewed Nov 17, 2025

View reviewed changes

src/brpc/rdma/rdma_endpoint.cpp Outdated Show resolved Hide resolved

Add IMM window

ea03983

chenBright requested a review from yanglimingcn December 17, 2025 11:37

chenBright force-pushed the fix_rdma_sq_window branch from 0b849d1 to 985c3b3 Compare December 17, 2025 12:08

chenBright requested a review from Copilot December 17, 2025 12:10

Copilot started reviewing on behalf of chenBright December 17, 2025 12:11 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

Deallocate polling cq

b02d13a

chenBright force-pushed the fix_rdma_sq_window branch from 985c3b3 to b02d13a Compare December 17, 2025 12:31

yanglimingcn reviewed Dec 19, 2025

View reviewed changes

src/brpc/rdma/rdma_endpoint.cpp Show resolved Hide resolved

Update RDMA documents

fb1e3c0

chenBright merged commit c7ae57a into apache:master Dec 22, 2025
17 checks passed

chenBright deleted the fix_rdma_sq_window branch December 22, 2025 11:00

Bugfix: SQ overflow #3145

Bugfix: SQ overflow #3145

Conversation

chenBright commented Nov 9, 2025

What problem does this PR solve?

What is changed and the side effects?

Check List:

Uh oh!

chenBright commented Nov 9, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yanglimingcn commented Nov 17, 2025

Uh oh!

chenBright commented Nov 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yanglimingcn commented Nov 17, 2025

Uh oh!

Uh oh!

chenBright commented Nov 17, 2025

Uh oh!

legionxiong commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenBright commented Nov 18, 2025

Uh oh!

legionxiong commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenBright commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

randomkang commented Dec 3, 2025

Uh oh!

yanglimingcn commented Dec 3, 2025

Uh oh!

randomkang commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yanglimingcn commented Dec 8, 2025

Uh oh!

chenBright commented Dec 8, 2025

Uh oh!

randomkang commented Dec 9, 2025

Uh oh!

yanglimingcn commented Dec 9, 2025

Uh oh!

chenBright commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yanglimingcn commented Dec 15, 2025

Uh oh!

chenBright commented Dec 15, 2025

Uh oh!

yanglimingcn commented Dec 15, 2025

Uh oh!

chenBright commented Dec 15, 2025

legionxiong commented Nov 18, 2025 •

edited

Loading

legionxiong commented Nov 18, 2025 •

edited

Loading

chenBright commented Nov 18, 2025 •

edited

Loading

randomkang commented Dec 8, 2025 •

edited

Loading

chenBright commented Dec 14, 2025 •

edited

Loading

randomkang commented Dec 17, 2025 •

edited

Loading