Open
Conversation
Comment on lines
428
to
429
| # else: | ||
| # is_correct = are_equal_under_sympy(ground_truth_elem, given_elem) |
Collaborator
There was a problem hiding this comment.
do we want to remove this?
| expr = expr.replace("\\dfrac", "\\frac") | ||
| expr = expr.replace("\\frac", " \\frac") # Play nice with mixed numbers. | ||
| expr = latex2text.LatexNodes2Text().latex_to_text(expr) | ||
| # expr = latex2text.LatexNodes2Text().latex_to_text(expr) |
Collaborator
There was a problem hiding this comment.
please also add the following comment # Added by Reasoning360
Collaborator
There was a problem hiding this comment.
imo, it would be ideal to create another directory called cispo instead of adding modifications in dapo
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR adds CISPO to
core_algos.pyto start integrations toward a full implementation of scaleRL.Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingAPI and Usage Example
CISPO is a sampled policy gradient loss that adopts a lot from the REINFORCE family of algorithms. How it differs from GRPO, etc is that the IS ratio is clipped directly rather than the fully policy clipping that is done in PPO derivatives. This introduces two new hyperparameters
cispo_clip_ratio_highandcispo_clip_ratio_lowto handle this clipping. They are each defaulted to 0.2.Also, we've introduced a new policy loss function for CISPO, which is employed when adjusting the
loss_modein the run configuration. Altogether this looks like: