Skip to content

Conversation

@mellanon
Copy link
Contributor

@mellanon mellanon commented Feb 2, 2026

Summary

  • Adds AI-powered headless mode for the Doctorow Gate, enabling non-interactive evaluation in CI/CD and agent pipelines
  • Auto-detects non-TTY environments or SPECFLOW_HEADLESS=true env var and routes to AI evaluation
  • Uses claude -p with Haiku model for fast, cheap per-check evaluation with 30s timeout
  • Falls back to pass-by-default on AI failure to never block pipelines

Details

New exported functions in doctorow.ts:

  • extractJsonFromResponse() — robust JSON extraction from LLM responses (handles Claude wrapper, markdown blocks, embedded JSON)
  • gatherArtifacts() — collects spec.md, plan.md, tasks.md, verify.md + src/ file listing for AI context
  • evaluateCheckWithAI() — evaluates a single Doctorow check via Bun.spawn(["claude", "-p", ...])
  • runDoctorowGateHeadless() — orchestrates all 4 checks and appends [AI-evaluated] results to verify.md

Modified:

  • runDoctorowGate() — added TTY detection routing before readline creation
  • formatVerifyEntry() — accepts optional evaluator tag for AI-evaluated entries
  • appendToVerifyMd() — passes evaluator tag through

Existing interactive mode is completely unchanged.

Closes #5

Test plan

  • 16 new tests for extractJsonFromResponse, gatherArtifacts, formatVerifyEntry with evaluator, and headless routing detection
  • All 36 existing doctorow tests pass unchanged
  • Binary rebuilt and installed to ~/bin/specflow

Generated with Claude Code

mellanon and others added 4 commits February 2, 2026 12:44
Previously, all four verify.md sections (Pre-Verification Checklist,
Smoke Test Results, Browser Verification, API Verification) required
substantive content, forcing users to --force bypass for CLI-only
features with no browser or API. Now sections containing "N/A",
"Not applicable", "Not required", or "CLI only" are accepted as valid.
Section headings must still exist.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds automatic AI evaluation of Doctorow Gate checks when running in
non-TTY environments (CI/CD, agent pipelines). Uses claude -p with
Haiku for fast, cheap evaluation. Falls back to pass-by-default on
AI failure to avoid blocking pipelines.

Closes #5

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Default to Sonnet (claude-sonnet-4-20250514) for better reasoning on
quality checks. Override via SPECFLOW_DOCTOROW_MODEL env var.

Supported models:
- claude-haiku-4-5-20251001 (fast/cheap)
- claude-sonnet-4-20250514 (balanced, default)
- claude-opus-4-5-20251101 (deep reasoning)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add --output-format json to claude -p invocation to ensure parseable
  output in environments with CLAUDE.md hooks/skills configured
- Change default model from Sonnet to Opus for deeper quality reasoning
- Model remains configurable via SPECFLOW_DOCTOROW_MODEL env var

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Owner

@jcfischer jcfischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Council Review: PR #6 — AI-powered headless Doctorow Gate

Verdict: MERGE WITH CHANGES | Confidence: HIGH
Council: Engineer, Architect, Security, Researcher (4 agents, 1 round each)

Must-Fix (blocking merge)

1. Fail-closed as default, not fail-open

A quality gate that auto-passes on failure is not a gate. The current design:

// On any AI failure, pass by default
return { checkId: check.id, confirmed: true, ... };

This means any API outage, timeout, or malformed response silently passes the check. For a gate whose purpose is catching issues, this undermines the design.

Required change:

  • Default: fail-closed (AI failure = gate failure, human must intervene)
  • Opt-in: SPECFLOW_DOCTOROW_FAILOPEN=true for pipelines that accept the risk
  • Differentiate tags in verify.md: [AI-evaluated] vs [AI-unavailable — passed by policy]

2. Process zombie on timeout

The timeout handler calls proc.kill() but never awaits proc.exited:

setTimeout(() => {
  proc.kill();
  resolve(null);  // process may still be running
}, 30000);

In CI, accumulated zombies are a real problem. Fix:

setTimeout(async () => {
  proc.kill();
  await proc.exited;
  resolve(null);
}, 30000);

Should-Fix (recommended)

3. Default model → Sonnet, not Opus

The model default changed 3 times across commits (Haiku→Sonnet→Opus). At Opus pricing, 4 Doctorow checks cost ~$0.60-1.20 per completion. At Sonnet: ~$0.12-0.24. The system prompt is too thin for Opus to add meaningful value over Sonnet:

"You are a code quality reviewer... Return ONLY valid JSON"

This gives the AI no framework for what "pass" means, no evaluation criteria, no evidence thresholds. Recommendation: default Sonnet, with Opus opt-in via SPECFLOW_DOCTOROW_MODEL.

4. Enrich the system prompt

Include: what constitutes sufficient evidence for each check type, when to fail (not just when to pass), that absence of evidence is not evidence of passing.

5. Integration test with mocked Bun.spawn

The test file explicitly states "Does NOT test actual claude -p calls." At minimum, one test should mock Bun.spawn to verify command construction, timeout behavior, and fail-open/closed path.

Nice-to-Have

  • Batch all 4 checks into a single AI call (75% cost reduction, ~4x faster)
  • Filter environment variables passed to subprocess (least privilege)
  • Double-write bug in appendToVerifyMdcontent += and appendFileSync both call formatVerifyEntry but the in-memory string is never used

What's Solid

  • extractJsonFromResponse() is well-designed with layered fallback (7 tests)
  • gatherArtifacts() handles missing files gracefully
  • TTY detection routing is clean and additive (interactive mode untouched)
  • formatVerifyEntry evaluator tag is backward compatible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Headless Doctorow Gate for CI/automation environments

2 participants