[CB] [Major] Asynchronous batching by remi-or · Pull Request #43960 · huggingface/transformers

remi-or · 2026-02-12T17:20:38Z

Summary

This PR adds the asynchronous batching feature to continuous batching (CB). Asynchronous batching, through the use of more VRAM and CUDA streams and events, greatly reduces the CPU overhead of preparing and updating batches by hiding it with GPU compute. In practice, the GPU can run uninterrupted for the workloads we tested the feature on (8B model, lots of generated tokens).
The PR also:

reduces the number of statuses for requests, thereby simplifying requests life cycles
optimizes greatly the generation of read and write indices for full attention groups, which benefits both synchronous and asynchronous workflows
caps the number of CUDA graphs instantiated at the same time
modifies the way batches are padded: the user can now define the size of interval of padding for both Q and KV dimensions rather than the total number of intervals. The new way still allows the user to strike a balance between the amount of padding and the frequency of recording graphs, without penalizing users that choose a large num_blocks or max_batch_tokens

This makes for a beefy PR, but it was needed in order to reach the intended performances for async workflows.

Performance

Arguments	Throughput on main	Throughput with PR	Delta
--samples 10	548.31	552.18	+3.87
--samples 20 --num-blocks 20	131.16	135.98	+4.82
--samples 50	1416.88	1449.68	+32.80
--samples 100	2361.93	2429.70	+67.77
--samples 100 --attn flash_attention_2	2028.53	2080.67	+52.14
--samples 100 --attn sdpa	834.30	846.19	+11.89
--samples 500	5396.69	5763.99	+367.30
--samples 500 --use-async	N/A	6331.60	+5396.69*
--samples 500 --add-prefix --compile	6934.67	6802.36	-132.31^
--samples 50 --num-return-sequences 8 --do-sample	650.73	694.97	+44.24
--samples 100 --num-return-sequences 4 --do-sample	1218.37	1291.48	+73.11

compared with main, without async
^ this is the margins of error for a run with compile. I think it's better we address compile reproducibility and optimization in a downstream PR.

Tests

Tests pass and we added new test for async and read / write indices.

Sanity checks

Generation looks ok.

HuggingFaceDocBuilderDev · 2026-02-12T17:34:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

remi-or added 30 commits February 6, 2026 15:56

Cleanup: batch is more self contained

334afd3

Created utils.py file

c8d7753

Moved pad to utils

8aee274

Pin memory for input and outputs

3bc0316

Consolidate inputs into a bulk tensor

1bcc526

Consolidated read and write indices

9a84bde

Add the transfer_inputs fn

92947aa

Renames and getters

a258ba1

Remove useless sync

c5877bc

Move graphs to the IOs

a75124c

Async done except for carry_in_ids

b4c13fd

Add carry over (scheduler not picking up tho)

da98e66

Remodeled scheduling

b736a13

Fix carry over

462ff76

Fix stream

810b55f

Bumped _upper_bound_num_blocks

8ead282

Faster compute for physical read indices

e4c429f

Final actual changes

6e68685

Adress some todos

ed7d437

Rename input_outputs

9004387

Modify the behavior of async

188ee86

Fix bugs

8f728f9

Added async tests

b33a6ca

Fix test

d48c2bd

Remodel example

353c261

Fix offload test

9b0dc90

Fix real cause of offload fail

cce892a

Nits

6cf6b97

Propagate use_async

f4a6a67

Performance fixes 1

7ef9f54

remi-or added 4 commits February 12, 2026 11:25

More flexibility for cuda graphs

e47d0ea

Remodeled the read and write indices

2c171fc

Review compliance

0c70eb3

More doc and beautifull ascii

df57dae

remi-or requested a review from ArthurZucker February 12, 2026 17:20

remi-or self-assigned this Feb 12, 2026

Style

66e05bc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CB] [Major] Asynchronous batching#43960

[CB] [Major] Asynchronous batching#43960
remi-or wants to merge 35 commits intomainfrom
cb-async-lfg

remi-or commented Feb 12, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

remi-or commented Feb 12, 2026

Summary

Performance

Tests

Sanity checks

Uh oh!

HuggingFaceDocBuilderDev commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants