This repository implements the code and experiments for Pareto Optimal Code Generation. The system uses outcome reward models (ORMs) with staged verification to shift the Pareto frontier of code generation, achieving higher throughput while trading off accuracy relative to full test-suite verification. Key features include:
- Training and evaluating code verification models
- Multiple scoring methods (binary logit, classification, reward modeling)
- Comprehensive evaluation across multiple benchmark datasets
- Efficient pruning strategies for scalable verification
- Support for various transformer architectures
.
├── configs/ # Configuration files for experiments and evaluation
│ └── evaluation/ # Evaluation configs for running base model
│ └── experiments/ # Full experiment configs
│ └── model/ # Different architectures configs
│ └── preprocessing/ # Prompting configs
│ └── scoring/ # Configurations for different scoring methods.
│ └── suite/ # Suite configurations for evaluation
│ └── trainer/ # Training configs
├── scripts/
│ ├── data/ # Data processing and generation
│ └── exec_trials/ # Execution trial implementations
├── src/
│ ├── evaluation/ # Evaluation suite and benchmarks
│ ├── modeling.py # Model architectures
│ ├── preprocessing.py # Data preparation
│ ├── scoring.py # Solution scoring
│ └── training/ # Training pipeline
└── figs/ # Project figures and diagrams
For detailed information about specific components:
git clone https://github.com/SprocketLab/orm-code-verifier.git
cd orm-code-verifierThe dependencies for training and evaluation can be installed with:
pip install -r requirements.txtAdditional Commands to run:
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git scratch/bigcode --depth=1
cd scratch/bigcode
pip install -e .
cd ..
pip install flash-attn --no-build-isolationpython scripts/make_train_data.py \
--num_proc=4 \
--black_format \
--require_pfThis will format the training data and save it to disk so it can be loaded faster. Then you can run:
bash scripts/experiment.sh rm_qsol qwen25-coder-1_5b {DEVICE} {SEED} \
--precision=bf16 \
--num_workers=4 \
--real_batch_size=64 \
--overwrite \
--batch_size=2 \
--val_batch_tokens=12000 \
gradient_checkpointing=True \
--eval_batch_tokens=200000Notes:
rm_qsolis the experiment to run, you can look at the other experiment configs for different setups.qsolis just the formatting setup for the sequences located in the preprocessing config directory.- We use the seeds of 1, 1999, and 2024 for our experiments in the paper.
The system supports three types of execution trials for comprehensive evaluation:
- Execution Timing: Measure performance and resource usage
- Syntax Validation: Check code correctness
- Linting Checks: Ensure code quality
To run the strongest verifier:
bash scripts/exec_trials/trial.sh code_contests qc-inst-7b t1.0_n128 32 outputs/ftp32_code_contets 5Key configuration parameters:
- Temperature and sample size (e.g., t1.0_n128 = temperature 1.0, 128 samples)
- Number of parallel workers
- Test execution timeouts
- Maximum tests per problem
For detailed configuration options and security considerations, see the Execution Trials Documentation.
The system provides multiple evaluation configurations, each serving different verification purposes:
- Base (zero_shot): Basic verification without additional checks
- Syntax (zero_shot_syntax): Focuses on syntactic correctness
- Lint (zero_shot_lint): Enforces code style and quality
- N Test:
To run evaluation with a specific configuration:
accelerate launch \
--gpu_ids 0 \
--mixed_precision=bf16 \
--config_file=configs/accelerate.yaml \
evaluate_model.py \
--precision=bf16 \
--device=0 \
-group={WANB_GROUP_NAME} \
--overwrite \
--max_tokens_per_batch=6000 \
--seed={SEED} \
--num_workers=16 \
qc-inst-7b \
t1.0_n128 \
checkpoint \
{CHECKPOINT_PATH} \
zero_shot_3s10t