Skip to content

Implement pass rate-based curriculum learning with weighted sampling#153

Open
jb3618columbia wants to merge 1 commit intoverl-latest-cispofrom
feature/pass-rate-curriculum-learning
Open

Implement pass rate-based curriculum learning with weighted sampling#153
jb3618columbia wants to merge 1 commit intoverl-latest-cispofrom
feature/pass-rate-curriculum-learning

Conversation

@jb3618columbia
Copy link
Collaborator

@jb3618columbia jb3618columbia commented Jan 24, 2026

Summary

Implements curriculum learning using pass rate-based weighted sampling for GRPO training

Changes

  • Add a PassRateTracker class to track attempt and success counts for each prompt. This tracker can be used for multiple curriculum samplers
  • Add a PassRateWeightedSampler class which implements a weighted sampler that adjusts sampling probabilities (probability of prompts sampled in a batch) based on historical pass rates (optional to use exponential moving average)
  • Update DAPOTrainer: pass rate tracker during training and logs curriculum metrics
  • Minor edits to make the integration work

Testing

  • Tested with local single-node runs

  • Tested with multi-node SLURM runs (2 nodes, 8 GPUs each)

  • Logs curriculum metrics: hardest_10pct/25pct/50pct/75pct pass rates, batch-level statistics

  • See curriculum learning runs: https://wandb.ai/mbzuai-llm/Reasoning360/runs/qab27nv0?nw=nwuserjalajbhandari

  • Example run: Curriculum-1435219-qwen2.5-32b-base-fsdp-temp_0.5_data_mixtures_round2_train_prompt_bsz_32

  1. Effective batch size increases with training, decreases when hard samples get sampled and then increases as the model learns to solve hard problems
Screenshot 2026-01-24 at 12 12 09 AM
  1. Pass rates of hard examples increase with training: model focuses on harder problems and starts to solve these (tracking the percentile of hard problems based on historical pass rates)
Screenshot 2026-01-24 at 12 15 49 AM Screenshot 2026-01-24 at 12 19 12 AM
  1. Right skew to the count distribution which shows the some prompts are only attempted a few times (easy prompts) while other samples are attempted multiple times (hard prompts)
Screenshot 2026-01-24 at 12 20 03 AM

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements pass rate-based curriculum learning for GRPO training by introducing weighted sampling that prioritizes harder samples (those with lower historical success rates).

Changes:

  • Added PassRateTracker class to track attempt counts and success rates for each prompt in the dataset
  • Added PassRateWeightedSampler class that implements curriculum learning through dynamic weighted sampling based on historical pass rates
  • Integrated curriculum learning into the DAPO trainer with pass rate tracking and curriculum-specific metrics logging
  • Updated configuration files and training scripts with curriculum learning examples

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
verl/utils/pass_rate_tracker.py Core tracker for maintaining historical pass rates and attempt counts per sample
verl/utils/pass_rate_weighted_sampler.py Weighted sampler that adjusts sampling probabilities based on pass rates
verl/utils/dataset/rl_dataset.py Added dataset_index field to enable sample tracking
verl/trainer/ppo/ray_trainer.py Added comment clarifying sampler creation
verl/trainer/ppo/metric_utils.py Added reward standard deviation metric
verl/trainer/config/data/legacy_data.yaml Added curriculum sampler configuration parameters
recipe/dapo/dapo_ray_trainer.py Integrated pass rate tracking and curriculum metrics logging into training loop
scripts/* Added example training scripts demonstrating curriculum learning usage

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@nightlessbaron nightlessbaron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jb3618columbia , this is good start. However, there are lots of things that we can improve on:

  1. Can we add customization to it such that the user can add a custom strategy in the weight sampler?
  2. Also add customization such that the user can define how many steps to wait before updating the weights.
  3. Please add some tests or optionally you can add them to RL360.
  4. Please add a short documentation as a quick start guide for people to use.
  5. Address all of my comments as well as those from codex :)

Good job overall :D

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this file from your changes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep the changes we made to the cluster environment set up so that people in the future can use the example script? If not, I can remove all the above changes.

One thought is to update/clean up these files as we go along so people can refer/use these out of the box

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, let's remove them for now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this file from your changes

# num dataloader workers
dataloader_num_workers: 8
# NOTE: Must be 0 when using curriculum learning samplers (e.g., PassRateWeightedSampler)
# to prevent data caching before batches are reordered.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i understand this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that having dataloader_num_workers > 0 will not be able to sample batches from the latest set of weights becuase of caching, but perhaps I am incorrect here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just means the DataLoader will use multiple worker subprocesses to fetch and preprocess batches in parallel with your training loop. If you set it to 0, data loading would happen in the main training process.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, would it lead to ''off policyness'' in terms of the weighting distribution? That was my concern

@jb3618columbia jb3618columbia force-pushed the feature/pass-rate-curriculum-learning branch 8 times, most recently from f24ef65 to ef85821 Compare January 30, 2026 00:54

"${CONDA_BIN_PATH}python" -m recipe.dapo.main_dapo \
--config-path=config \
--config-name="dapo_fsdp_config_with_resampling.yaml" \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nightlessbaron this config is a now used for pass rate based weighted sampling

@jb3618columbia jb3618columbia force-pushed the feature/pass-rate-curriculum-learning branch 2 times, most recently from 83d9600 to ec3726c Compare January 30, 2026 00:59
@jb3618columbia jb3618columbia force-pushed the feature/pass-rate-curriculum-learning branch from ec3726c to ecc8b8c Compare January 30, 2026 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants