Skip to content

Conversation

@shuyixiong
Copy link
Contributor

@shuyixiong shuyixiong commented Jan 29, 2026

Closes #501
As title. Add stateless group support, so that we could remove the dependency of vLLM in train backend.

Test Result
Tested on llama3.1-8b-instruct and dsv3, and only tested on non-colocated case since the change only affect this.

Train backend vLLM nodes vLLM Setting Result
DTensor 1 or 2 TP2PP1 image
DTensor 1 or 2 TP1PP2 image
DTensor 2 TP2PP2 image
Megatron 16 TP32 image image

Summary by CodeRabbit

  • New Features

    • Introduced stateless process group for distributed NCCL-based communication, enabling flexible rank coordination.
  • Chores

    • Added dependencies: nccl4py, cuda-bindings, and pybase64.
    • Removed vllm from optional dependency groups.
    • Updated distributed communication initialization across multiple components.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>
Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>
Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>
@shuyixiong shuyixiong force-pushed the shuyix/decouple_stateless_group branch from e5fdfd1 to 76a01cf Compare January 29, 2026 09:29
@yuki-97
Copy link
Contributor

yuki-97 commented Jan 29, 2026

@shuyixiong thanks so much for supporing this! I'll take the rest part.

@yuki-97 yuki-97 changed the title Shuyix/decouple stateless group feat: suppport stateless group and decouple vLLM in train backend Jan 29, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Jan 29, 2026
@yuki-97 yuki-97 changed the title feat: suppport stateless group and decouple vLLM in train backend feat: support stateless group and decouple vLLM in train backend Jan 29, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 29, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 29, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 30, 2026
@yuki-97 yuki-97 marked this pull request as ready for review January 30, 2026 05:57
@yuki-97 yuki-97 requested review from a team as code owners January 30, 2026 05:57
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

This change introduces a new StatelessProcessGroup class for NCCL-based distributed communication using TCPStore for unique ID coordination, and updates dependent code to use this new class instead of PyNcclCommunicator. Dependencies are updated to include nccl4py and cuda-bindings while removing vllm from optional dependency groups.

Changes

Cohort / File(s) Summary
StatelessProcessGroup Implementation
nemo_rl/distributed/stateless_process_group.py
New class enabling stateless NCCL communication with TCPStore-based unique ID coordination. Rank 0 generates and stores the ID while other ranks retrieve it, followed by NCCL communicator initialization, warmup broadcast verification, and synchronization.
Model Initialization Integration
nemo_rl/models/generation/vllm/vllm_backend.py, nemo_rl/models/policy/workers/base_policy_worker.py
Updated to replace PyNcclCommunicator with direct StatelessProcessGroup instantiation, adding explicit init_nccl_communicator(device=...) call after construction.
Dependency Management
pyproject.toml
Added nccl4py, cuda-bindings, and pybase64 dependencies. Removed vllm==0.11.2 from fsdp, automodel, and mcore optional dependency groups.

Sequence Diagram(s)

sequenceDiagram
    participant R0 as Rank 0
    participant R1 as Other Ranks
    participant TCPStore as TCPStore
    participant NCCL as NCCL Communicator

    R0->>TCPStore: Generate unique_id
    R0->>TCPStore: Store unique_id
    par
        R0->>NCCL: Init with unique_id
        R1->>TCPStore: Retrieve unique_id
        R1->>NCCL: Init with unique_id
    end
    R0->>NCCL: Warmup broadcast (verify connectivity)
    R1->>NCCL: Receive warmup broadcast
    NCCL->>NCCL: Synchronize all ranks
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: introducing stateless group support and decoupling vLLM from the training backend, which aligns with the primary objectives of the PR.
Test Results For Major Changes ✅ Passed PR includes comprehensive test results documentation with multiple configurations tested (DTensor with TP2PP1, TP1PP2, TP2PP2; Megatron with TP32), environment specifications (llama3.1-8b-instruct, dsv3), and training metric dashboards showing stable performance across runs.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch shuyix/decouple_stateless_group

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@nemo_rl/distributed/stateless_process_group.py`:
- Around line 35-63: The warmup verification in init_nccl_communicator uses
assert which can be disabled; replace it with an explicit check that raises a
clear exception (e.g., RuntimeError) if the broadcast didn't produce the
expected result: after torch.cuda.current_stream().synchronize(), verify
torch.allclose(data, torch.ones(1, device=device)) and if false raise an error
mentioning nccl communicator initialization failure, the rank, and
UNIQUE_ID_KEY/unique_id context so failure causes a hard stop and useful logs
rather than being skipped under -O.
🧹 Nitpick comments (1)
nemo_rl/distributed/stateless_process_group.py (1)

22-34: Add class and method docstrings for public API.

This class is part of a public interface used by other modules. Adding Google-style docstrings would improve maintainability and enable proper Sphinx documentation generation.

📝 Suggested docstring addition
 class StatelessProcessGroup:
+    """Stateless NCCL-based distributed communication group.
+
+    Uses TCPStore for rank coordination and NCCL unique ID exchange,
+    enabling distributed collective operations without persistent
+    process group state.
+
+    Attributes:
+        master_address: Host address for TCPStore coordination.
+        port: Port number for TCPStore.
+        rank: This process's rank in the group.
+        world_size: Total number of processes in the group.
+        tcp_store: TCPStore instance for coordination.
+        nccl_communicator: NCCL communicator (initialized via init_nccl_communicator).
+    """
+
     def __init__(self, master_address: str, port: int, rank: int, world_size: int):
+        """Initialize the stateless process group.
+
+        Args:
+            master_address: Host address for the TCPStore.
+            port: Port number for the TCPStore.
+            rank: Rank of this process (0-indexed).
+            world_size: Total number of processes in the group.
+        """
         self.master_address = master_address

Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @shuyixiong for cleaning this up! this will definitely help us maintain other backends more easily by decoupling vllm from training backend!

could you update the megatron TP16 plot's logprob error by clipping it? i just wanted to confirm that it's not steadily increasing

@youngeunkwon0405
Copy link
Contributor

Hi @shuyixiong, I have a quick question. Is change using nccl4py, which was developed recently?

@yuki-97
Copy link
Contributor

yuki-97 commented Jan 31, 2026

could you update the megatron TP16 plot's logprob error by clipping it? i just wanted to confirm that it's not steadily increasing

@terrykong added, both main and this PR have high logprob error in dsv3 TP32, this change won't cause the logprob error rising. and from train/token_mult_prob_error_plot_sample we can see there's only one or two tokens have obvious different logprob error, which should be fine.

@yuki-97
Copy link
Contributor

yuki-97 commented Jan 31, 2026

Hi @shuyixiong, I have a quick question. Is change using nccl4py, which was developed recently?

yes, it is using this.
https://github.com/NVIDIA-NeMo/RL/pull/1842/changes#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711R53

@yuki-97 yuki-97 merged commit d5aca5a into main Jan 31, 2026
50 of 52 checks passed
@yuki-97 yuki-97 deleted the shuyix/decouple_stateless_group branch January 31, 2026 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use native collective instead of vllm's in non-colocated

5 participants