[algo] Adding CISPO policy loss by twkillian · Pull Request #150 · LLM360/Reasoning360

twkillian · 2025-11-06T17:59:25Z

What does this PR do?

This PR adds CISPO to core_algos.py to start integrations toward a full implementation of scaleRL.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

API and Usage Example

CISPO is a sampled policy gradient loss that adopts a lot from the REINFORCE family of algorithms. How it differs from GRPO, etc is that the IS ratio is clipped directly rather than the fully policy clipping that is done in PPO derivatives. This introduces two new hyperparameters cispo_clip_ratio_high and cispo_clip_ratio_low to handle this clipping. They are each defaulted to 0.2.

Also, we've introduced a new policy loss function for CISPO, which is employed when adjusting the loss_mode in the run configuration. Altogether this looks like:

actor_rollout_ref.actor.policy_loss.loss_mode=cispo \
actor_rollout_ref.actor.policy_loss.cispo_clip_ratio_high=0.2 \
actor_rollout_ref.actor.policy_loss.cispo_clip_ratio_low=0.2 \

nightlessbaron · 2025-12-10T07:00:14Z

verl/utils/reward_score/naive_dapo.py

+            # else:
+            #     is_correct = are_equal_under_sympy(ground_truth_elem, given_elem)


do we want to remove this?

nightlessbaron · 2025-12-10T07:02:59Z

verl/utils/reward_score/prime_math/__init__.py

    expr = expr.replace("\\dfrac", "\\frac")
    expr = expr.replace("\\frac", " \\frac")  # Play nice with mixed numbers.
-    expr = latex2text.LatexNodes2Text().latex_to_text(expr)
+    # expr = latex2text.LatexNodes2Text().latex_to_text(expr)


please also add the following comment # Added by Reasoning360

nightlessbaron · 2025-12-10T07:03:59Z

recipe/dapo/dapo_ray_trainer.py

imo, it would be ideal to create another directory called cispo instead of adding modifications in dapo

twkillian added 3 commits November 5, 2025 19:34

Added CISPO loss function

4d704a0

Configurations and small debugging to get CISPO running

5c448ad

Completing verl-expected policy loss signature

015a6b8

twkillian requested a review from nightlessbaron November 6, 2025 17:59

twkillian added 6 commits November 8, 2025 00:25

Update to rull full on-policy

154135e

Fixing CISPO IS clipping to be one-sided as intended

fd9fd42

Sync

73d2b95

sync

99c7c65

Sync

9019074

Sync with Latex and sympy commented out from naive_dapo.py

312ed82

nightlessbaron reviewed Dec 10, 2025

View reviewed changes

twkillian and others added 10 commits December 11, 2025 05:11

Sync

61cb565

Updating rewards to better handle OmniMath scoring

dc631bd

Sync

b405647

Updated reward functions from the async branch

235af1a

sync

68c4492

Synch

e132514

fix reward functions

dc4300c

delete 7B script

8e671a0

add parsing timeout

b69d7e7

fix critical bugs in reward functions

47b4ccc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[algo] Adding CISPO policy loss#150

[algo] Adding CISPO policy loss#150
twkillian wants to merge 19 commits intoverl-latestfrom
verl-latest-cispo

twkillian commented Nov 6, 2025

Uh oh!

nightlessbaron Dec 10, 2025

Uh oh!

nightlessbaron Dec 10, 2025

Uh oh!

nightlessbaron Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# else:
		# is_correct = are_equal_under_sympy(ground_truth_elem, given_elem)

Conversation

twkillian commented Nov 6, 2025

What does this PR do?

Checklist Before Starting

API and Usage Example

Uh oh!

nightlessbaron Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

nightlessbaron Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

nightlessbaron Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants