Implement pass rate-based curriculum learning with weighted sampling by jb3618columbia · Pull Request #153 · LLM360/Reasoning360

jb3618columbia · 2026-01-24T08:15:15Z

Summary

Implements curriculum learning using pass rate-based weighted sampling for GRPO training

Changes

Add a PassRateTracker class to track attempt and success counts for each prompt. This tracker can be used for multiple curriculum samplers
Add a PassRateWeightedSampler class which implements a weighted sampler that adjusts sampling probabilities (probability of prompts sampled in a batch) based on historical pass rates (optional to use exponential moving average)
Update DAPOTrainer: pass rate tracker during training and logs curriculum metrics
Minor edits to make the integration work

Testing

Tested with local single-node runs
Tested with multi-node SLURM runs (2 nodes, 8 GPUs each)
Logs curriculum metrics: hardest_10pct/25pct/50pct/75pct pass rates, batch-level statistics
See curriculum learning runs: https://wandb.ai/mbzuai-llm/Reasoning360/runs/qab27nv0?nw=nwuserjalajbhandari
Example run: Curriculum-1435219-qwen2.5-32b-base-fsdp-temp_0.5_data_mixtures_round2_train_prompt_bsz_32

Effective batch size increases with training, decreases when hard samples get sampled and then increases as the model learns to solve hard problems

Pass rates of hard examples increase with training: model focuses on harder problems and starts to solve these (tracking the percentile of hard problems based on historical pass rates)

Right skew to the count distribution which shows the some prompts are only attempted a few times (easy prompts) while other samples are attempted multiple times (hard prompts)

Copilot

Pull request overview

This PR implements pass rate-based curriculum learning for GRPO training by introducing weighted sampling that prioritizes harder samples (those with lower historical success rates).

Changes:

Added PassRateTracker class to track attempt counts and success rates for each prompt in the dataset
Added PassRateWeightedSampler class that implements curriculum learning through dynamic weighted sampling based on historical pass rates
Integrated curriculum learning into the DAPO trainer with pass rate tracking and curriculum-specific metrics logging
Updated configuration files and training scripts with curriculum learning examples

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
verl/utils/pass_rate_tracker.py	Core tracker for maintaining historical pass rates and attempt counts per sample
verl/utils/pass_rate_weighted_sampler.py	Weighted sampler that adjusts sampling probabilities based on pass rates
verl/utils/dataset/rl_dataset.py	Added dataset_index field to enable sample tracking
verl/trainer/ppo/ray_trainer.py	Added comment clarifying sampler creation
verl/trainer/ppo/metric_utils.py	Added reward standard deviation metric
verl/trainer/config/data/legacy_data.yaml	Added curriculum sampler configuration parameters
recipe/dapo/dapo_ray_trainer.py	Integrated pass rate tracking and curriculum metrics logging into training loop
scripts/*	Added example training scripts demonstrating curriculum learning usage

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

verl/utils/pass_rate_weighted_sampler.py

recipe/dapo/dapo_ray_trainer.py

verl/utils/pass_rate_weighted_sampler.py

scripts/train/example_singlenode_rl_qwen2.5_7b_base_fsdp.sh

verl/utils/pass_rate_tracker.py

scripts/train/pass_rate_weighted_sampler_multinode_rl_qwen2.5_32b_base_fsdp.sh

recipe/dapo/dapo_ray_trainer.py

scripts/train/example_singlenode_rl_qwen2.5_7b_base_fsdp.sh

nightlessbaron

Hey @jb3618columbia , this is good start. However, there are lots of things that we can improve on:

Can we add customization to it such that the user can add a custom strategy in the weight sampler?
Also add customization such that the user can define how many steps to wait before updating the weights.
Please add some tests or optionally you can add them to RL360.
Please add a short documentation as a quick start guide for people to use.
Address all of my comments as well as those from codex :)

Good job overall :D

recipe/dapo/dapo_ray_trainer.py

nightlessbaron · 2026-01-28T04:08:14Z

scripts/train/example_multinode_rl_qwen2.5_32b_base_fsdp.sh

remove this file from your changes

Do we want to keep the changes we made to the cluster environment set up so that people in the future can use the example script? If not, I can remove all the above changes.

One thought is to update/clean up these files as we go along so people can refer/use these out of the box

as discussed, let's remove them for now

scripts/tools/serve_math_llm_as_verifier.sh

nightlessbaron · 2026-01-28T04:08:32Z

scripts/train/example_singlenode_rl_qwen2.5_7b_base_fsdp.sh

remove this file from your changes

nightlessbaron · 2026-01-28T04:10:38Z

verl/trainer/config/data/legacy_data.yaml

 # num dataloader workers
-dataloader_num_workers: 8
+# NOTE: Must be 0 when using curriculum learning samplers (e.g., PassRateWeightedSampler)
+# to prevent data caching before batches are reordered.


not sure i understand this

My understanding is that having dataloader_num_workers > 0 will not be able to sample batches from the latest set of weights becuase of caching, but perhaps I am incorrect here

It just means the DataLoader will use multiple worker subprocesses to fetch and preprocess batches in parallel with your training loop. If you set it to 0, data loading would happen in the main training process.

Hmm, would it lead to ''off policyness'' in terms of the weighting distribution? That was my concern

verl/utils/pass_rate_weighted_sampler.py

jb3618columbia · 2026-01-30T00:56:02Z

scripts/train/pass_rate_weighted_sampler_multinode_rl_qwen2.5_32b_base_fsdp.sh

+
+"${CONDA_BIN_PATH}python" -m recipe.dapo.main_dapo \
+    --config-path=config \
+    --config-name="dapo_fsdp_config_with_resampling.yaml" \


@nightlessbaron this config is a now used for pass rate based weighted sampling

jb3618columbia requested review from Copilot and nightlessbaron January 24, 2026 08:22

Copilot started reviewing on behalf of jb3618columbia January 24, 2026 08:31 View session

Copilot AI reviewed Jan 24, 2026

View reviewed changes

nightlessbaron requested changes Jan 28, 2026

View reviewed changes

jb3618columbia force-pushed the feature/pass-rate-curriculum-learning branch 8 times, most recently from f24ef65 to ef85821 Compare January 30, 2026 00:54

jb3618columbia commented Jan 30, 2026

View reviewed changes

jb3618columbia force-pushed the feature/pass-rate-curriculum-learning branch 2 times, most recently from 83d9600 to ec3726c Compare January 30, 2026 00:59

pass rate based weighted sampler tested with local and multi-node runs

ecc8b8c

jb3618columbia force-pushed the feature/pass-rate-curriculum-learning branch from ec3726c to ecc8b8c Compare January 30, 2026 00:59

Conversation

jb3618columbia commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nightlessbaron left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jb3618columbia commented Jan 24, 2026 •

edited

Loading