feat: AI-powered headless Doctorow Gate #6

mellanon · 2026-02-02T01:20:33Z

Summary

Adds AI-powered headless mode for the Doctorow Gate, enabling non-interactive evaluation in CI/CD and agent pipelines
Auto-detects non-TTY environments or SPECFLOW_HEADLESS=true env var and routes to AI evaluation
Uses claude -p with Haiku model for fast, cheap per-check evaluation with 30s timeout
Falls back to pass-by-default on AI failure to never block pipelines

Details

New exported functions in doctorow.ts:

extractJsonFromResponse() — robust JSON extraction from LLM responses (handles Claude wrapper, markdown blocks, embedded JSON)
gatherArtifacts() — collects spec.md, plan.md, tasks.md, verify.md + src/ file listing for AI context
evaluateCheckWithAI() — evaluates a single Doctorow check via Bun.spawn(["claude", "-p", ...])
runDoctorowGateHeadless() — orchestrates all 4 checks and appends [AI-evaluated] results to verify.md

Modified:

runDoctorowGate() — added TTY detection routing before readline creation
formatVerifyEntry() — accepts optional evaluator tag for AI-evaluated entries
appendToVerifyMd() — passes evaluator tag through

Existing interactive mode is completely unchanged.

Closes #5

Test plan

16 new tests for extractJsonFromResponse, gatherArtifacts, formatVerifyEntry with evaluator, and headless routing detection
All 36 existing doctorow tests pass unchanged
Binary rebuilt and installed to ~/bin/specflow

Generated with Claude Code

Previously, all four verify.md sections (Pre-Verification Checklist, Smoke Test Results, Browser Verification, API Verification) required substantive content, forcing users to --force bypass for CLI-only features with no browser or API. Now sections containing "N/A", "Not applicable", "Not required", or "CLI only" are accepted as valid. Section headings must still exist. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds automatic AI evaluation of Doctorow Gate checks when running in non-TTY environments (CI/CD, agent pipelines). Uses claude -p with Haiku for fast, cheap evaluation. Falls back to pass-by-default on AI failure to avoid blocking pipelines. Closes #5 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Default to Sonnet (claude-sonnet-4-20250514) for better reasoning on quality checks. Override via SPECFLOW_DOCTOROW_MODEL env var. Supported models: - claude-haiku-4-5-20251001 (fast/cheap) - claude-sonnet-4-20250514 (balanced, default) - claude-opus-4-5-20251101 (deep reasoning) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add --output-format json to claude -p invocation to ensure parseable output in environments with CLAUDE.md hooks/skills configured - Change default model from Sonnet to Opus for deeper quality reasoning - Model remains configurable via SPECFLOW_DOCTOROW_MODEL env var Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

jcfischer

Council Review: PR #6 — AI-powered headless Doctorow Gate

Verdict: MERGE WITH CHANGES | Confidence: HIGH
Council: Engineer, Architect, Security, Researcher (4 agents, 1 round each)

Must-Fix (blocking merge)

1. Fail-closed as default, not fail-open

A quality gate that auto-passes on failure is not a gate. The current design:

// On any AI failure, pass by default
return { checkId: check.id, confirmed: true, ... };

This means any API outage, timeout, or malformed response silently passes the check. For a gate whose purpose is catching issues, this undermines the design.

Required change:

Default: fail-closed (AI failure = gate failure, human must intervene)
Opt-in: SPECFLOW_DOCTOROW_FAILOPEN=true for pipelines that accept the risk
Differentiate tags in verify.md: [AI-evaluated] vs [AI-unavailable — passed by policy]

2. Process zombie on timeout

The timeout handler calls proc.kill() but never awaits proc.exited:

setTimeout(() => {
  proc.kill();
  resolve(null);  // process may still be running
}, 30000);

In CI, accumulated zombies are a real problem. Fix:

setTimeout(async () => {
  proc.kill();
  await proc.exited;
  resolve(null);
}, 30000);

Should-Fix (recommended)

3. Default model → Sonnet, not Opus

The model default changed 3 times across commits (Haiku→Sonnet→Opus). At Opus pricing, 4 Doctorow checks cost ~$0.60-1.20 per completion. At Sonnet: ~$0.12-0.24. The system prompt is too thin for Opus to add meaningful value over Sonnet:

"You are a code quality reviewer... Return ONLY valid JSON"

This gives the AI no framework for what "pass" means, no evaluation criteria, no evidence thresholds. Recommendation: default Sonnet, with Opus opt-in via SPECFLOW_DOCTOROW_MODEL.

4. Enrich the system prompt

Include: what constitutes sufficient evidence for each check type, when to fail (not just when to pass), that absence of evidence is not evidence of passing.

5. Integration test with mocked Bun.spawn

The test file explicitly states "Does NOT test actual claude -p calls." At minimum, one test should mock Bun.spawn to verify command construction, timeout behavior, and fail-open/closed path.

Nice-to-Have

Batch all 4 checks into a single AI call (75% cost reduction, ~4x faster)
Filter environment variables passed to subprocess (least privilege)
Double-write bug in appendToVerifyMd — content += and appendFileSync both call formatVerifyEntry but the in-memory string is never used

What's Solid

extractJsonFromResponse() is well-designed with layered fallback (7 tests)
gatherArtifacts() handles missing files gracefully
TTY detection routing is clean and additive (interactive mode untouched)
formatVerifyEntry evaluator tag is backward compatible

mellanon and others added 4 commits February 2, 2026 12:44

mellanon mentioned this pull request Feb 2, 2026

feat: Full headless mode for autonomous SpecFlow pipelines #7

Open

6 tasks

jcfischer requested changes Feb 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AI-powered headless Doctorow Gate #6

feat: AI-powered headless Doctorow Gate #6

mellanon commented Feb 2, 2026

Uh oh!

jcfischer left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: AI-powered headless Doctorow Gate #6

Are you sure you want to change the base?

feat: AI-powered headless Doctorow Gate #6

Conversation

mellanon commented Feb 2, 2026

Summary

Details

Test plan

Uh oh!

jcfischer left a comment

Choose a reason for hiding this comment

Council Review: PR #6 — AI-powered headless Doctorow Gate

Must-Fix (blocking merge)

Should-Fix (recommended)

Nice-to-Have

What's Solid

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants