-
Notifications
You must be signed in to change notification settings - Fork 2
feat: AI-powered headless Doctorow Gate #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Previously, all four verify.md sections (Pre-Verification Checklist, Smoke Test Results, Browser Verification, API Verification) required substantive content, forcing users to --force bypass for CLI-only features with no browser or API. Now sections containing "N/A", "Not applicable", "Not required", or "CLI only" are accepted as valid. Section headings must still exist. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds automatic AI evaluation of Doctorow Gate checks when running in non-TTY environments (CI/CD, agent pipelines). Uses claude -p with Haiku for fast, cheap evaluation. Falls back to pass-by-default on AI failure to avoid blocking pipelines. Closes #5 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Default to Sonnet (claude-sonnet-4-20250514) for better reasoning on quality checks. Override via SPECFLOW_DOCTOROW_MODEL env var. Supported models: - claude-haiku-4-5-20251001 (fast/cheap) - claude-sonnet-4-20250514 (balanced, default) - claude-opus-4-5-20251101 (deep reasoning) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add --output-format json to claude -p invocation to ensure parseable output in environments with CLAUDE.md hooks/skills configured - Change default model from Sonnet to Opus for deeper quality reasoning - Model remains configurable via SPECFLOW_DOCTOROW_MODEL env var Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
jcfischer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Council Review: PR #6 — AI-powered headless Doctorow Gate
Verdict: MERGE WITH CHANGES | Confidence: HIGH
Council: Engineer, Architect, Security, Researcher (4 agents, 1 round each)
Must-Fix (blocking merge)
1. Fail-closed as default, not fail-open
A quality gate that auto-passes on failure is not a gate. The current design:
// On any AI failure, pass by default
return { checkId: check.id, confirmed: true, ... };This means any API outage, timeout, or malformed response silently passes the check. For a gate whose purpose is catching issues, this undermines the design.
Required change:
- Default: fail-closed (AI failure = gate failure, human must intervene)
- Opt-in:
SPECFLOW_DOCTOROW_FAILOPEN=truefor pipelines that accept the risk - Differentiate tags in verify.md:
[AI-evaluated]vs[AI-unavailable — passed by policy]
2. Process zombie on timeout
The timeout handler calls proc.kill() but never awaits proc.exited:
setTimeout(() => {
proc.kill();
resolve(null); // process may still be running
}, 30000);In CI, accumulated zombies are a real problem. Fix:
setTimeout(async () => {
proc.kill();
await proc.exited;
resolve(null);
}, 30000);Should-Fix (recommended)
3. Default model → Sonnet, not Opus
The model default changed 3 times across commits (Haiku→Sonnet→Opus). At Opus pricing, 4 Doctorow checks cost ~$0.60-1.20 per completion. At Sonnet: ~$0.12-0.24. The system prompt is too thin for Opus to add meaningful value over Sonnet:
"You are a code quality reviewer... Return ONLY valid JSON"
This gives the AI no framework for what "pass" means, no evaluation criteria, no evidence thresholds. Recommendation: default Sonnet, with Opus opt-in via SPECFLOW_DOCTOROW_MODEL.
4. Enrich the system prompt
Include: what constitutes sufficient evidence for each check type, when to fail (not just when to pass), that absence of evidence is not evidence of passing.
5. Integration test with mocked Bun.spawn
The test file explicitly states "Does NOT test actual claude -p calls." At minimum, one test should mock Bun.spawn to verify command construction, timeout behavior, and fail-open/closed path.
Nice-to-Have
- Batch all 4 checks into a single AI call (75% cost reduction, ~4x faster)
- Filter environment variables passed to subprocess (least privilege)
- Double-write bug in
appendToVerifyMd—content +=andappendFileSyncboth callformatVerifyEntrybut the in-memory string is never used
What's Solid
extractJsonFromResponse()is well-designed with layered fallback (7 tests)gatherArtifacts()handles missing files gracefully- TTY detection routing is clean and additive (interactive mode untouched)
formatVerifyEntryevaluator tag is backward compatible
Summary
SPECFLOW_HEADLESS=trueenv var and routes to AI evaluationclaude -pwith Haiku model for fast, cheap per-check evaluation with 30s timeoutDetails
New exported functions in
doctorow.ts:extractJsonFromResponse()— robust JSON extraction from LLM responses (handles Claude wrapper, markdown blocks, embedded JSON)gatherArtifacts()— collects spec.md, plan.md, tasks.md, verify.md + src/ file listing for AI contextevaluateCheckWithAI()— evaluates a single Doctorow check viaBun.spawn(["claude", "-p", ...])runDoctorowGateHeadless()— orchestrates all 4 checks and appends[AI-evaluated]results to verify.mdModified:
runDoctorowGate()— added TTY detection routing before readline creationformatVerifyEntry()— accepts optionalevaluatortag for AI-evaluated entriesappendToVerifyMd()— passes evaluator tag throughExisting interactive mode is completely unchanged.
Closes #5
Test plan
extractJsonFromResponse,gatherArtifacts,formatVerifyEntrywith evaluator, and headless routing detectionGenerated with Claude Code