feat: support stateless group and decouple vLLM in train backend #1842

shuyixiong · 2026-01-29T09:19:37Z

Closes #501
As title. Add stateless group support, so that we could remove the dependency of vLLM in train backend.

Test Result
Tested on llama3.1-8b-instruct and dsv3, and only tested on non-colocated case since the change only affect this.

Train backend	vLLM nodes	vLLM Setting
DTensor	1 or 2	TP2PP1
DTensor	1 or 2	TP1PP2
DTensor	2	TP2PP2
Megatron	16	TP32

Summary by CodeRabbit

New Features
- Introduced stateless process group for distributed NCCL-based communication, enabling flexible rank coordination.
Chores
- Added dependencies: nccl4py, cuda-bindings, and pybase64.
- Removed vllm from optional dependency groups.
- Updated distributed communication initialization across multiple components.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>

yuki-97 · 2026-01-29T09:38:44Z

@shuyixiong thanks so much for supporing this! I'll take the rest part.

Signed-off-by: Yuki Huang <yukih@nvidia.com>

coderabbitai · 2026-01-30T06:00:45Z

📝 Walkthrough

Walkthrough

This change introduces a new StatelessProcessGroup class for NCCL-based distributed communication using TCPStore for unique ID coordination, and updates dependent code to use this new class instead of PyNcclCommunicator. Dependencies are updated to include nccl4py and cuda-bindings while removing vllm from optional dependency groups.

Changes

Cohort / File(s)	Summary
StatelessProcessGroup Implementation `nemo_rl/distributed/stateless_process_group.py`	New class enabling stateless NCCL communication with TCPStore-based unique ID coordination. Rank 0 generates and stores the ID while other ranks retrieve it, followed by NCCL communicator initialization, warmup broadcast verification, and synchronization.
Model Initialization Integration `nemo_rl/models/generation/vllm/vllm_backend.py`, `nemo_rl/models/policy/workers/base_policy_worker.py`	Updated to replace `PyNcclCommunicator` with direct `StatelessProcessGroup` instantiation, adding explicit `init_nccl_communicator(device=...)` call after construction.
Dependency Management `pyproject.toml`	Added nccl4py, cuda-bindings, and pybase64 dependencies. Removed vllm==0.11.2 from fsdp, automodel, and mcore optional dependency groups.

Sequence Diagram(s)

sequenceDiagram
    participant R0 as Rank 0
    participant R1 as Other Ranks
    participant TCPStore as TCPStore
    participant NCCL as NCCL Communicator

    R0->>TCPStore: Generate unique_id
    R0->>TCPStore: Store unique_id
    par
        R0->>NCCL: Init with unique_id
        R1->>TCPStore: Retrieve unique_id
        R1->>NCCL: Init with unique_id
    end
    R0->>NCCL: Warmup broadcast (verify connectivity)
    R1->>NCCL: Receive warmup broadcast
    NCCL->>NCCL: Synchronize all ranks

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: introducing stateless group support and decoupling vLLM from the training backend, which aligns with the primary objectives of the PR.
Test Results For Major Changes	✅ Passed	PR includes comprehensive test results documentation with multiple configurations tested (DTensor with TP2PP1, TP1PP2, TP2PP2; Megatron with TP32), environment specifications (llama3.1-8b-instruct, dsv3), and training metric dashboards showing stable performance across runs.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch shuyix/decouple_stateless_group

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@nemo_rl/distributed/stateless_process_group.py`:
- Around line 35-63: The warmup verification in init_nccl_communicator uses
assert which can be disabled; replace it with an explicit check that raises a
clear exception (e.g., RuntimeError) if the broadcast didn't produce the
expected result: after torch.cuda.current_stream().synchronize(), verify
torch.allclose(data, torch.ones(1, device=device)) and if false raise an error
mentioning nccl communicator initialization failure, the rank, and
UNIQUE_ID_KEY/unique_id context so failure causes a hard stop and useful logs
rather than being skipped under -O.

🧹 Nitpick comments (1)

nemo_rl/distributed/stateless_process_group.py (1)

22-34: Add class and method docstrings for public API.

This class is part of a public interface used by other modules. Adding Google-style docstrings would improve maintainability and enable proper Sphinx documentation generation.

📝 Suggested docstring addition

 class StatelessProcessGroup:
+    """Stateless NCCL-based distributed communication group.
+
+    Uses TCPStore for rank coordination and NCCL unique ID exchange,
+    enabling distributed collective operations without persistent
+    process group state.
+
+    Attributes:
+        master_address: Host address for TCPStore coordination.
+        port: Port number for TCPStore.
+        rank: This process's rank in the group.
+        world_size: Total number of processes in the group.
+        tcp_store: TCPStore instance for coordination.
+        nccl_communicator: NCCL communicator (initialized via init_nccl_communicator).
+    """
+
     def __init__(self, master_address: str, port: int, rank: int, world_size: int):
+        """Initialize the stateless process group.
+
+        Args:
+            master_address: Host address for the TCPStore.
+            port: Port number for the TCPStore.
+            rank: Rank of this process (0-indexed).
+            world_size: Total number of processes in the group.
+        """
         self.master_address = master_address

nemo_rl/distributed/stateless_process_group.py

terrykong

thanks @shuyixiong for cleaning this up! this will definitely help us maintain other backends more easily by decoupling vllm from training backend!

could you update the megatron TP16 plot's logprob error by clipping it? i just wanted to confirm that it's not steadily increasing

youngeunkwon0405 · 2026-01-30T18:16:49Z

Hi @shuyixiong, I have a quick question. Is change using nccl4py, which was developed recently?

yuki-97 · 2026-01-31T02:22:30Z

could you update the megatron TP16 plot's logprob error by clipping it? i just wanted to confirm that it's not steadily increasing

@terrykong added, both main and this PR have high logprob error in dsv3 TP32, this change won't cause the logprob error rising. and from train/token_mult_prob_error_plot_sample we can see there's only one or two tokens have obvious different logprob error, which should be fine.

yuki-97 · 2026-01-31T02:23:17Z

Hi @shuyixiong, I have a quick question. Is change using nccl4py, which was developed recently?

yes, it is using this.
https://github.com/NVIDIA-NeMo/RL/pull/1842/changes#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711R53

shuyixiong added 3 commits January 29, 2026 01:23

Decouple stateless_group from vllm

bd753a6

Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>

Fix multi stream issue and add warmup

6f42827

Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>

Pre-commit changes

76a01cf

Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>

shuyixiong force-pushed the shuyix/decouple_stateless_group branch from e5fdfd1 to 76a01cf Compare January 29, 2026 09:29

yuki-97 changed the title ~~Shuyix/decouple stateless group~~ feat: suppport stateless group and decouple vLLM in train backend Jan 29, 2026

yuki-97 added 2 commits January 29, 2026 04:29

copyright and lint

0fec555

Signed-off-by: Yuki Huang <yukih@nvidia.com>

remove vllm dependency

5979c61

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Jan 29, 2026

yuki-97 had a problem deploying to nemo-ci January 29, 2026 12:34 — with GitHub Actions Error

yuki-97 changed the title ~~feat: suppport stateless group and decouple vLLM in train backend~~ feat: support stateless group and decouple vLLM in train backend Jan 29, 2026

lint

f2a37d1

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 29, 2026

yuki-97 had a problem deploying to nemo-ci January 29, 2026 12:40 — with GitHub Actions Error

fix after remove vllm

60ae5b7

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 29, 2026

yuki-97 temporarily deployed to nemo-ci January 29, 2026 15:26 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci January 29, 2026 18:53 — with GitHub Actions Inactive

fix sglang dependency

4b8b0e4

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 30, 2026

yuki-97 temporarily deployed to nemo-ci January 30, 2026 03:38 — with GitHub Actions Inactive

yuki-97 marked this pull request as ready for review January 30, 2026 05:57

yuki-97 requested review from a team as code owners January 30, 2026 05:57

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

nemo_rl/distributed/stateless_process_group.py Show resolved Hide resolved

yuki-97 temporarily deployed to nemo-ci January 30, 2026 07:31 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci January 30, 2026 16:03 — with GitHub Actions Inactive

terrykong approved these changes Jan 30, 2026

View reviewed changes

yuki-97 merged commit d5aca5a into main Jan 31, 2026
50 of 52 checks passed

yuki-97 deleted the shuyix/decouple_stateless_group branch January 31, 2026 02:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support stateless group and decouple vLLM in train backend #1842

feat: support stateless group and decouple vLLM in train backend #1842

shuyixiong commented Jan 29, 2026 •

edited by yuki-97

Loading

Uh oh!

yuki-97 commented Jan 29, 2026

Uh oh!

coderabbitai bot commented Jan 30, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

terrykong left a comment

Uh oh!

youngeunkwon0405 commented Jan 30, 2026

Uh oh!

yuki-97 commented Jan 31, 2026 •

edited

Loading

Uh oh!

yuki-97 commented Jan 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat: support stateless group and decouple vLLM in train backend #1842

feat: support stateless group and decouple vLLM in train backend #1842

Conversation

shuyixiong commented Jan 29, 2026 • edited by yuki-97 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

yuki-97 commented Jan 29, 2026

Uh oh!

coderabbitai bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

youngeunkwon0405 commented Jan 30, 2026

Uh oh!

yuki-97 commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuki-97 commented Jan 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shuyixiong commented Jan 29, 2026 •

edited by yuki-97

Loading

coderabbitai bot commented Jan 30, 2026 •

edited

Loading

yuki-97 commented Jan 31, 2026 •

edited

Loading