Skip to content

[CB] [Major] Asynchronous batching#43960

Open
remi-or wants to merge 35 commits intomainfrom
cb-async-lfg
Open

[CB] [Major] Asynchronous batching#43960
remi-or wants to merge 35 commits intomainfrom
cb-async-lfg

Conversation

@remi-or
Copy link
Collaborator

@remi-or remi-or commented Feb 12, 2026

Summary

This PR adds the asynchronous batching feature to continuous batching (CB). Asynchronous batching, through the use of more VRAM and CUDA streams and events, greatly reduces the CPU overhead of preparing and updating batches by hiding it with GPU compute. In practice, the GPU can run uninterrupted for the workloads we tested the feature on (8B model, lots of generated tokens).
The PR also:

  • reduces the number of statuses for requests, thereby simplifying requests life cycles
  • optimizes greatly the generation of read and write indices for full attention groups, which benefits both synchronous and asynchronous workflows
  • caps the number of CUDA graphs instantiated at the same time
  • modifies the way batches are padded: the user can now define the size of interval of padding for both Q and KV dimensions rather than the total number of intervals. The new way still allows the user to strike a balance between the amount of padding and the frequency of recording graphs, without penalizing users that choose a large num_blocks or max_batch_tokens

This makes for a beefy PR, but it was needed in order to reach the intended performances for async workflows.

Performance

Arguments Throughput on main Throughput with PR Delta
--samples 10 548.31 552.18 +3.87
--samples 20 --num-blocks 20 131.16 135.98 +4.82
--samples 50 1416.88 1449.68 +32.80
--samples 100 2361.93 2429.70 +67.77
--samples 100 --attn flash_attention_2 2028.53 2080.67 +52.14
--samples 100 --attn sdpa 834.30 846.19 +11.89
--samples 500 5396.69 5763.99 +367.30
--samples 500 --use-async N/A 6331.60 +5396.69*
--samples 500 --add-prefix --compile 6934.67 6802.36 -132.31^
--samples 50 --num-return-sequences 8 --do-sample 650.73 694.97 +44.24
--samples 100 --num-return-sequences 4 --do-sample 1218.37 1291.48 +73.11
  • compared with main, without async
    ^ this is the margins of error for a run with compile. I think it's better we address compile reproducibility and optimization in a downstream PR.

Tests

Tests pass and we added new test for async and read / write indices.

Sanity checks

Generation looks ok.

@remi-or remi-or requested a review from ArthurZucker February 12, 2026 17:20
@remi-or remi-or self-assigned this Feb 12, 2026
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants