███████╗██╗ ██╗██╗██╗ ██╗ ██╗███████╗███████╗██╗ ██╗███████╗███████╗
██╔════╝██║ ██╔╝██║██║ ██║ ██║██╔════╝██╔════╝██║ ██║██╔════╝██╔════╝
███████╗█████╔╝ ██║██║ ██║ ██║███████╗███████╗██║ ██║█████╗ ███████╗
╚════██║██╔═██╗ ██║██║ ██║ ██║╚════██║╚════██║██║ ██║██╔══╝ ╚════██║
███████║██║ ██╗██║███████╗███████╗ ██║███████║███████║╚██████╔╝███████╗███████║
╚══════╝╚═╝ ╚═╝╚═╝╚══════╝╚══════╝ ╚═╝╚══════╝╚══════╝ ╚═════╝ ╚══════╝╚══════╝
DSL AI Code Generation Evaluation Framework
This repository contains two main components:
- Skills - Reusable knowledge packages that improve AI code generation quality
- Eval Harness - A multi-stage evaluation system for measuring and improving AI-generated code
The step-loop is our primary evaluation tool. It breaks complex coding tasks into incremental steps, validates each step, and produces production-quality code.
./skill-issues run cairo-trapping-rain-water-01That's it. The CLI infers paths, applies default skills, and runs the full evaluation.
With options:
./skill-issues run cairo-trapping-rain-water-01 \
-m claude-opus-4-20250514 \
--clean \
-vOther commands:
./skill-issues list # Show available prompts
./skill-issues status cairo-trapping-rain-water-01 # Check run status
./skill-issues clean cairo-trapping-rain-water-01 # Remove generated filesThis command:
- Reads a 6-step prompt (brute force → DP → two-pointer optimization)
- Generates code incrementally, validating each step with
scarb build - Runs tests with
snforge testat completion - Produces a modular multi-file project structure
- Applies
cairo-quirksandcairo-qualityskills for better output
eval/work/cairo-trapping-rain-water-01/
├── Scarb.toml
├── src/
│ ├── lib.cairo # Module exports
│ └── solution.cairo # Implementation (3 algorithms)
└── tests/
└── test_lib.cairo # 17+ integration tests
The generated solution.cairo includes:
trap_brute_force()- O(n²) time, O(1) spacetrap_dp()- O(n) time, O(n) spacetrap()- O(n) time, O(1) space (optimal two-pointer)- Full documentation with complexity analysis
- Comprehensive test coverage
AI code generators often produce code that:
- Compiles but has subtle bugs
- Uses suboptimal algorithms
- Has poor structure (everything in one file)
- Lacks documentation and tests
- Contains unused imports and lint warnings
This system addresses these issues through:
- Incremental validation - Each step must compile before proceeding
- Skills - Domain knowledge injected into prompts
- Multi-file structure - Proper separation of concerns
- Quality skills - Guidelines for DRY, complexity, documentation
skill-issues/
├── skills/ # Reusable skill packages
│ ├── cairo-quirks/ # Cairo language patterns
│ └── cairo-quality/ # Code quality guidelines
├── eval/
│ ├── prompts/ # Task definitions (one per file)
│ ├── rubrics/ # Pass/fail criteria
│ ├── work/ # Generated projects (gitignored)
│ └── ralph/
│ ├── step-loop.sh # Main evaluation runner
│ └── .state/ # Execution state (gitignored)
└── dist/ # Packaged .skill files
Option A — User-scoped (available in all repos)
mkdir -p ~/.codex/skills
cp -R ./skills/cairo-* ~/.codex/skills/Option B — Repo-scoped (checked into this repo)
mkdir -p ./.codex/skills
cp -R ./skills/cairo-* ./.codex/skills/Using packaged .skill files
mkdir -p ~/.codex/skills
unzip ./dist/cairo-*.skill -d ~/.codex/skills- Scarb - Cairo package manager
- snforge - Starknet testing framework
claudeCLI orcodexCLI for AI backends
- Eval Harness Overview - Full evaluation system docs
- Step Loop Guide - Detailed step-loop documentation
- Prompts Guide - How to write prompts
- Rubrics Guide - How to write rubrics (also see
eval/rubrics/)
┌─────────────────────────────────────────────────────────────────┐
│ step-loop.sh │
├─────────────────────────────────────────────────────────────────┤
│ 1. Parse prompt into steps │
│ 2. Scaffold project (scarb new) │
│ 3. For each step: │
│ a. Build prompt with accumulated code + skills │
│ b. Call LLM backend (claude/codex) │
│ c. Extract code from response │
│ d. Write to project files │
│ e. Validate (scarb check → scarb build) │
│ f. On failure: retry with error feedback (up to 3x) │
│ g. Record metrics │
│ 4. Run tests (snforge test) │
│ 5. Run linter (scarb lint) │
│ 6. Output final metrics │
└─────────────────────────────────────────────────────────────────┘
Skills are markdown files that provide domain-specific knowledge to improve code generation.
Cairo language patterns and common pitfalls:
- Array immutability and ownership
- Felt252 vs u256 usage
- Storage patterns for Starknet
- Common compiler errors and fixes
Code quality guidelines:
- Algorithm documentation (time/space complexity)
- DRY principles
- Unused import prevention
- Naming conventions
- Test quality standards
Each run produces metrics at .state/<project>/metrics.json:
{
"prompt_id": "cairo-trapping-rain-water-01",
"total_steps": 6,
"steps_completed": 6,
"total_iterations": 6,
"lint_warnings": 0,
"tests_passed": 17,
"tests_failed": 0,
"status": "completed"
}- Add a prompt: Create
eval/prompts/<id>.mdwith step-by-step tasks - Add a rubric: Create
eval/rubrics/<id>.mdwith pass/fail criteria - Run evaluation: Use step-loop to test generation quality
- Improve skills: Add patterns that fix common failures
MIT