Add chat evaluation reports and evaluator implementation by serefyarar · Pull Request #387 · indexnetwork/index

serefyarar · 2026-02-03T02:11:12Z

Added new chat evaluation reports and results in the protocol directory, including markdown and JSON files for multiple scenarios. Introduced chat evaluator implementation and corresponding tests under src/lib/protocol/graphs/chat, providing the core logic for running and testing chat evaluation flows.

Summary by CodeRabbit

Release Notes

Documentation
- Added comprehensive Chat Agent evaluation reports documenting performance across multiple interaction scenarios, including success/failure metrics, detailed analysis, and identified issues with recommendations for improvement.
- Added detailed conversation logs with representative interactions and evaluation narratives.
Chores
- Added eval:chat script to enable running Chat Agent evaluations.

Added new chat evaluation reports and results in the protocol directory, including markdown and JSON files for multiple scenarios. Introduced chat evaluator implementation and corresponding tests under src/lib/protocol/graphs/chat, providing the core logic for running and testing chat evaluation flows.

coderabbitai · 2026-02-03T02:11:30Z

📝 Walkthrough

Walkthrough

A comprehensive chat agent evaluation framework is introduced with scenario generation, need fulfillment assessment, simulated user interaction, and test orchestration that produces structured reports and aggregated metrics from multiple evaluation runs.

Changes

Cohort / File(s)	Summary
Evaluation Framework `protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts`	Core framework defining user needs, personas, journeys taxonomies; implements ScenarioGenerator (LLM-driven), NeedFulfillmentEvaluator, SimulatedUser, and test orchestration infrastructure (runNeedFulfillmentTest, runTestSuite) with public interfaces for external integration.
Test Harness & Spec `protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts`, `protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts`	Execution harness with mocked ChatGraphCompositeDatabase, embedder, scraper, and ChatAgentInterface adapter; includes manual scenario definitions (INTENT_EXPRESSION, DISCOVERY, COMBINED), result aggregation, and report/log generation. Spec file provides comprehensive test suite with stateful mock database and tool-tracking via createChatAgentAdapter export.
Evaluation Reports `protocol/chat-eval-report-2026-02-03T01-42-49-355Z.md`, `protocol/chat-eval-report-2026-02-03T01-56-27.md`, `protocol/chat-eval-report-2026-02-03T02-01-12.md`, `protocol/eval-reports/chat-eval-report-2026-02-03T02-06-25.md`, `protocol/eval-reports/chat-eval-conversations-2026-02-03T02-06-25.md`	Markdown evaluation reports documenting executive metrics (success/partial/failure counts), per-category results, tool usage patterns, key issues, per-scenario verdicts with conversation excerpts, and recommendations for agent improvements.
Evaluation Results Data `protocol/chat-eval-results-2026-02-03T01-42-49-355Z.json`, `protocol/eval-reports/chat-eval-results-2026-02-03T02-06-25.json`	Structured JSON datasets capturing 24 evaluation scenarios with metadata (category, verdict, score, tools used, duration), conversation history, and signal analysis (successSignals, failureSignals) for post-hoc analytics.
Configuration `protocol/package.json`	Added npm script `eval:chat` pointing to the evaluator harness runner (bun ./src/lib/protocol/graphs/chat/chat.evaluator.run.ts).

Sequence Diagram

sequenceDiagram
    participant Main as Evaluation Main
    participant ScenarioGen as ScenarioGenerator
    participant ChatAgent as ChatAgentAdapter
    participant SimUser as SimulatedUser
    participant Evaluator as NeedFulfillmentEvaluator
    participant DB as Mock Database
    participant Reporter as ReportGenerator

    Main->>ScenarioGen: generateMessage(need, persona)
    ScenarioGen-->>Main: generatedMessage
    
    Main->>SimUser: new SimulatedUser(scenario)
    SimUser-->>Main: initialized
    
    Main->>ChatAgent: reset()
    ChatAgent->>DB: initialize state
    
    loop Conversation Turns
        Main->>SimUser: getInitialMessage()
        SimUser-->>Main: userMessage
        
        Main->>ChatAgent: chat(userMessage)
        ChatAgent->>DB: execute graph operations
        DB-->>ChatAgent: response + toolsUsed
        ChatAgent-->>Main: response, toolsUsed
        
        Main->>SimUser: respond(assistantMessage)
        SimUser-->>Main: decision to continue
    end
    
    Main->>Evaluator: evaluate(scenario, conversation, toolsUsed)
    Evaluator-->>Main: evaluation result (verdict, score, signals)
    
    Main->>Reporter: generateReport(results)
    Reporter-->>Main: markdown report
    
    Main->>Reporter: generateConversationsLog(results)
    Reporter-->>Main: conversation log

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

feat: LangGraph Protocol Implementation with Streaming Chat #383: Implements and expands chat graph components (ChatGraphFactory, adapters, state/types) that the evaluator harness directly invokes and depend upon.

Suggested labels

codex

Poem

🐰 A rabbit hops through scenarios so fine,
Needs, personas, journeys intertwined,
LLM-crafted conversations flow,
Evaluations bloom, what secrets they know!
Reports and metrics, all neatly bound,
Testing a chat agent, soundly and round. 🐇

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add chat evaluation reports and evaluator implementation' accurately summarizes the main changes: adding evaluation report files and implementing the evaluator framework.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch test-maniac

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In `@protocol/eval-reports/chat-eval-report-2026-02-03T02-01-12.md`:
- Around line 827-833: The recommendations list generator is producing
non-sequential numbering (1,2,4) because the code that assembles or renders the
recommendations list skips an index; locate the evaluator function responsible
for producing the recommendations block (e.g., generateRecommendations,
buildRecommendationsList, or format_recommendations) and fix the indexing logic
so it emits sequential numbers: ensure the counter is initialized once and
incremented for each recommendation, avoid using sparse keys or filtered arrays
that preserve original indices without reindexing, and update the renderer to
enumerate items by their current position rather than original IDs so item 3 is
not omitted.

In `@protocol/eval-reports/chat-eval-report-2026-02-03T02-06-25.md`:
- Around line 786-792: The numbered list under the "Recommendations" header
currently starts at 2 and skips 1 (see items beginning with "2. **Improve
discovery error handling:**" etc.); update the report formatter or the markdown
content so the recommendations are sequentially numbered beginning at 1 (rename
"2." → "1.", "3." → "2.", "4." → "3."), ensure any cross-references or IDs
linked to these recommendation items are updated accordingly, and run the
formatter that generates this section (the function/component that emits the
"Recommendations" block) to prevent future off-by-one numbering regressions.

In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts`:
- Around line 658-681: The generated markdown fenced code blocks are missing a
language identifier which triggers MD040; update the places that push the
opening fence (currently lines.push("```")) to use a text identifier
(lines.push("```text")) where the conversation block is built (look for the code
that inspects result.conversation and pushes "```" before and after the loop) so
both the empty-conversation branch and the loop branch emit "```text" as the
opening fence while keeping the closing fence as "```".
- Around line 734-741: The table incorrectly labels stats.toolUsage as "Times
Called" even though stats.toolUsage is built from per-scenario unique
metadata.toolsUsed; update the presentation or the metric: either rename the
column header to something like "Scenarios Used" (change the string literal in
chat.evaluator.run.ts where the header row is built) to accurately reflect the
data, or change the aggregation that populates stats.toolUsage to count every
invocation (modify the code that collects metadata.toolsUsed into
stats.toolUsage to increment per-call rather than per-scenario) and ensure the
variable name and sorting still match the rest of the logic; update any
references to stats.toolUsage accordingly.

In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts`:
- Around line 191-447: The tests call real LLMs (via ScenarioGenerator,
NeedFulfillmentEvaluator, NeedFulfillmentTest and ChatOpenAI) and will fail in
CI without OPENROUTER_API_KEY; update the spec to gate or mock these calls by
checking process.env.OPENROUTER_API_KEY in each top-level describe (or in a
shared beforeAll) and skip the suite when absent, or inject a mock model/agent
from createChatAgentAdapter/runNeedFulfillmentTest/runTestSuite when the env var
is missing; ensure the gating uses the unique symbols ScenarioGenerator,
runNeedFulfillmentTest, runTestSuite, createChatAgentAdapter, and ChatOpenAI so
tests either use a deterministic mock implementation in CI or are skipped when
the API key is not available.

In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts`:
- Around line 939-945: The parallel branch uses a shared mutable chatAgent with
runNeedFulfillmentTest which resets/mutates the agent and causes race
conditions; change the API or call site to supply a fresh agent per scenario
(e.g., accept an agentFactory() instead of chatAgent, or clone/initialize a new
agent for each scenario inside the parallel branch) and update the parallel path
to call agentFactory() for each Promise, or alternatively disable/throw when
options?.parallel is true and no factory is provided; refer to
runNeedFulfillmentTest and the local chatAgent variable to locate where to
inject the factory/clone logic or the validation that prevents unsafe parallel
execution.

🧹 Nitpick comments (4)

protocol/chat-eval-results-2026-02-03T01-42-49-355Z.json (1)

1-909: Inconsistent file location for evaluation results.

This JSON results file is placed in protocol/ root, while other evaluation artifacts (like chat-eval-results-2026-02-03T02-06-25.json) are in protocol/eval-reports/. Consider moving this file to protocol/eval-reports/ for consistent organization.

Additionally, consider whether these timestamped evaluation result files should be committed to the repository at all. If they are generated artifacts from running the evaluator, they may be better suited for:

A .gitignore entry (if they're local development artifacts)

A CI artifact storage (if they're needed for historical tracking)
protocol/eval-reports/chat-eval-conversations-2026-02-03T02-06-25.md (1)
29-40: Consider adding language specifier to fenced code blocks.

The conversation transcript code blocks lack a language specifier, triggering markdownlint warnings. While the content renders correctly, adding text or plaintext as the language would satisfy linters and improve consistency.

Example change:
-```
+```text
 👤 USER:
 I'm looking to hire ML engineers for my startup
 ...
This pattern applies to all 24 conversation blocks throughout the file.
protocol/chat-eval-report-2026-02-03T01-42-49-355Z.md (1)

1-804: Inconsistent file location for evaluation report.

This report file is located in protocol/ root, while similar reports (e.g., chat-eval-report-2026-02-03T02-06-25.md) are in protocol/eval-reports/. Move this file to protocol/eval-reports/ for consistent organization.
protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts (1)
27-94: Align mock intent shape with origin fields and tighten typing.
The mock state relies heavily on any casts and as unknown as, which weakens strict mode, and the created intent lacks origin tracking fields. This can mask schema-dependent behavior in tests/harness runs. Consider a typed mock state and include sourceType/sourceId in created intents.
🔧 Suggested adjustment for intent origin fields
       const intent = {
         id: `intent-${Date.now()}-${Math.random().toString(36).slice(2)}`,
         payload: data.payload,
         summary: data.payload.slice(0, 100),
         userId: data.userId,
         createdAt: new Date(),
         updatedAt: new Date(),
         deletedAt: null,
         isIncognito: false,
+        sourceType: data.sourceType ?? "test",
+        sourceId: data.sourceId ?? data.userId,
       };
As per coding guidelines, use strict TypeScript mode for all code and intents must track their origin via polymorphic sourceType and sourceId fields.

coderabbitai · 2026-02-03T02:20:12Z

protocol/eval-reports/chat-eval-report-2026-02-03T02-01-12.md

+## Recommendations
+
+1. **Rebalance tool selection:** The agent calls `find_opportunities` much more than `create_intent`. When users express a want/need, the agent should first capture it as an intent before searching.
+
+2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).
+
+4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.


⚠️ Potential issue | 🟡 Minor

Recommendations numbering skips item 3.

The recommendations are numbered 1, 2, 4 - missing item 3. This is likely a bug in the report generation logic that should be fixed in the evaluator code.

Proposed fix

## Recommendations 1. **Rebalance tool selection:** The agent calls `find_opportunities` much more than `create_intent`. When users express a want/need, the agent should first capture it as an intent before searching. 2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear). -4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic. +3. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

## Recommendations

1. **Rebalance tool selection:** The agent calls `find_opportunities` much more than `create_intent`. When users express a want/need, the agent should first capture it as an intent before searching.

2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).

4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.

## Recommendations

1. **Rebalance tool selection:** The agent calls `find_opportunities` much more than `create_intent`. When users express a want/need, the agent should first capture it as an intent before searching.

2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).

3. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.

🤖 Prompt for AI Agents

In `@protocol/eval-reports/chat-eval-report-2026-02-03T02-01-12.md` around lines 827 - 833, The recommendations list generator is producing non-sequential numbering (1,2,4) because the code that assembles or renders the recommendations list skips an index; locate the evaluator function responsible for producing the recommendations block (e.g., generateRecommendations, buildRecommendationsList, or format_recommendations) and fix the indexing logic so it emits sequential numbers: ensure the counter is initialized once and incremented for each recommendation, avoid using sparse keys or filtered arrays that preserve original indices without reindexing, and update the renderer to enumerate items by their current position rather than original IDs so item 3 is not omitted.

coderabbitai · 2026-02-03T02:20:13Z

protocol/eval-reports/chat-eval-report-2026-02-03T02-06-25.md

+## Recommendations
+
+2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).
+
+3. **Distinguish offers from wants:** The agent treats user offers (e.g., 'I do freelance work') the same as wants. It should recognize offers and create appropriate intents.
+
+4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.


⚠️ Potential issue | 🟡 Minor

Recommendations numbering is incorrect.

The recommendations section starts at item 2 and skips item 1. This appears to be a generation bug in the report formatter.

Proposed fix

## Recommendations -2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear). +1. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear). -3. **Distinguish offers from wants:** The agent treats user offers (e.g., 'I do freelance work') the same as wants. It should recognize offers and create appropriate intents. +2. **Distinguish offers from wants:** The agent treats user offers (e.g., 'I do freelance work') the same as wants. It should recognize offers and create appropriate intents. -4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic. +3. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

## Recommendations

2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).

3. **Distinguish offers from wants:** The agent treats user offers (e.g., 'I do freelance work') the same as wants. It should recognize offers and create appropriate intents.

4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.

## Recommendations

1. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).

2. **Distinguish offers from wants:** The agent treats user offers (e.g., 'I do freelance work') the same as wants. It should recognize offers and create appropriate intents.

3. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.

🤖 Prompt for AI Agents

In `@protocol/eval-reports/chat-eval-report-2026-02-03T02-06-25.md` around lines 786 - 792, The numbered list under the "Recommendations" header currently starts at 2 and skips 1 (see items beginning with "2. **Improve discovery error handling:**" etc.); update the report formatter or the markdown content so the recommendations are sequentially numbered beginning at 1 (rename "2." → "1.", "3." → "2.", "4." → "3."), ensure any cross-references or IDs linked to these recommendation items are updated accordingly, and run the formatter that generates this section (the function/component that emits the "Recommendations" block) to prevent future off-by-one numbering regressions.

coderabbitai · 2026-02-03T02:20:13Z

protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts

+  lines.push("**Conversation:**");
+  lines.push("");
+  lines.push("```");
+
+  if (result.conversation.length === 0) {
+    lines.push("(No conversation recorded)");
+  } else {
+    for (const turn of result.conversation) {
+      const prefix = turn.role === "user" ? "👤 USER" : "🤖 ASSISTANT";
+      lines.push(`${prefix}:`);
+      // Wrap long lines
+      const content = turn.content.split("\n").map(line => {
+        if (line.length > 100) {
+          return line.match(/.{1,100}/g)?.join("\n  ") || line;
+        }
+        return line;
+      }).join("\n");
+      lines.push(content);
+      lines.push("");
+    }
+  }
+
+  lines.push("```");
+  lines.push("");


⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks.
Generated reports/logs emit fenced blocks without a language, triggering MD040 warnings. Consider ```text for these snippets.

🔧 Suggested change

- lines.push("```"); + lines.push("```text"); if (result.conversation.length === 0) { lines.push("(No conversation recorded)"); } else { for (const turn of result.conversation) { const prefix = turn.role === "user" ? "👤 USER" : "🤖 ASSISTANT"; lines.push(`${prefix}:`); ... } } - lines.push("```"); + lines.push("```");

- lines.push("```"); + lines.push("```text"); for (const turn of result.conversation) { const prefix = turn.role === "user" ? "👤 USER" : "🤖 ASSISTANT"; lines.push(`${prefix}:`); lines.push(turn.content); lines.push(""); } - lines.push("```"); + lines.push("```");

🤖 Prompt for AI Agents

In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts` around lines 658 - 681, The generated markdown fenced code blocks are missing a language identifier which triggers MD040; update the places that push the opening fence (currently lines.push("```")) to use a text identifier (lines.push("```text")) where the conversation block is built (look for the code that inspects result.conversation and pushes "```" before and after the loop) so both the empty-conversation branch and the loop branch emit "```text" as the opening fence while keeping the closing fence as "```".

coderabbitai · 2026-02-03T02:20:13Z

protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts

+  // Tool Usage
+  lines.push("## Tool Usage Patterns");
+  lines.push("");
+  lines.push("| Tool | Times Called |");
+  lines.push("|------|--------------|");
+  for (const [tool, count] of Object.entries(stats.toolUsage).sort((a, b) => b[1] - a[1])) {
+    lines.push(`| ${tool} | ${count} |`);
+  }


⚠️ Potential issue | 🟡 Minor

Tool usage is deduped per scenario but labeled as “Times Called.”
toolUsage counts unique tools per scenario (from metadata.toolsUsed), not actual call counts. Either track raw call counts or rename the column to avoid misleading metrics.

📝 Minimal label fix

- lines.push("| Tool | Times Called |"); - lines.push("|------|--------------|"); + lines.push("| Tool | Scenarios Used |"); + lines.push("|------|----------------|");

🤖 Prompt for AI Agents

In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts` around lines 734 - 741, The table incorrectly labels stats.toolUsage as "Times Called" even though stats.toolUsage is built from per-scenario unique metadata.toolsUsed; update the presentation or the metric: either rename the column header to something like "Scenarios Used" (change the string literal in chat.evaluator.run.ts where the header row is built) to accurately reflect the data, or change the aggregation that populates stats.toolUsage to count every invocation (modify the code that collects metadata.toolsUsed into stats.toolUsage to increment per-call rather than per-scenario) and ensure the variable name and sorting still match the rest of the logic; update any references to stats.toolUsage accordingly.

coderabbitai · 2026-02-03T02:20:13Z

protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts

+describe("Scenario Generation", () => {
+  it("should generate diverse messages for EXPRESS_WANT need", async () => {
+    const generator = new ScenarioGenerator();
+    const scenarios = await generator.generateScenariosForNeed("EXPRESS_WANT", 3);
+
+    expect(scenarios.length).toBe(3);
+
+    // Each scenario should have different persona
+    const personas = new Set(scenarios.map((s) => s.persona.id));
+    expect(personas.size).toBeGreaterThan(1);
+
+    // Messages should be non-empty and different
+    const messages = scenarios.map((s) => s.generatedMessage);
+    expect(messages.every((m) => m.length > 5)).toBe(true);
+
+    console.log("\nGenerated EXPRESS_WANT scenarios:");
+    for (const s of scenarios) {
+      console.log(`  [${s.persona.id}]: "${s.generatedMessage}"`);
+    }
+  }, 60000);
+
+  it("should generate journey scenarios with context progression", async () => {
+    const generator = new ScenarioGenerator();
+    const scenarios = await generator.generateJourneyScenario("ONBOARDING_FLOW", "NEW_USER");
+
+    expect(scenarios.length).toBe(3); // ESTABLISH_PRESENCE, EXPRESS_WANT, FIND_PEOPLE
+
+    // Context should evolve
+    expect(scenarios[0].context.hasProfile).toBe(false);
+    expect(scenarios[1].context.hasProfile).toBe(true); // After ESTABLISH_PRESENCE
+
+    console.log("\nGenerated ONBOARDING_FLOW journey:");
+    for (const s of scenarios) {
+      console.log(`  [${s.need.id}]: "${s.generatedMessage}"`);
+    }
+  }, 90000);
+});
+
+// ═══════════════════════════════════════════════════════════════════════════════
+// SINGLE NEED FULFILLMENT TESTS
+// ═══════════════════════════════════════════════════════════════════════════════
+
+describe("Need Fulfillment - Single Needs", () => {
+  let database: ChatGraphCompositeDatabase;
+  let chatAgent: ChatAgentInterface;
+  let generator: ScenarioGenerator;
+
+  beforeAll(() => {
+    database = createStatefulMockDatabase();
+    chatAgent = createChatAgentAdapter(database);
+    generator = new ScenarioGenerator();
+  });
+
+  it("should fulfill EXPRESS_WANT need", async () => {
+    const scenarios = await generator.generateScenariosForNeed("EXPRESS_WANT", 1);
+    const result = await runNeedFulfillmentTest(scenarios[0], chatAgent, {
+      verbose: true,
+      maxTurns: 3,
+      timeoutMs: 90000,
+    });
+
+    console.log("\n=== EXPRESS_WANT Result ===");
+    console.log(`Verdict: ${result.evaluation.overallVerdict}`);
+    console.log(`Score: ${result.evaluation.fulfillmentScore}`);
+    console.log(`Reasoning: ${result.evaluation.reasoning}`);
+    console.log(`Tools: ${result.metadata.toolsUsed.join(", ")}`);
+
+    // We expect success or partial - the agent should at least try
+    expect(["success", "partial"]).toContain(result.evaluation.overallVerdict);
+  }, 120000);
+
+  it("should fulfill FIND_PEOPLE need", async () => {
+    const scenarios = await generator.generateScenariosForNeed("FIND_PEOPLE", 1);
+    const result = await runNeedFulfillmentTest(scenarios[0], chatAgent, {
+      verbose: true,
+      maxTurns: 3,
+      timeoutMs: 90000,
+    });
+
+    console.log("\n=== FIND_PEOPLE Result ===");
+    console.log(`Verdict: ${result.evaluation.overallVerdict}`);
+    console.log(`Reasoning: ${result.evaluation.reasoning}`);
+
+    // Agent should use discovery tool
+    expect(result.metadata.toolsUsed.length).toBeGreaterThan(0);
+  }, 120000);
+
+  it("should handle UNDERSTAND_SYSTEM need without tools", async () => {
+    const scenarios = await generator.generateScenariosForNeed("UNDERSTAND_SYSTEM", 1);
+    const result = await runNeedFulfillmentTest(scenarios[0], chatAgent, {
+      verbose: true,
+      maxTurns: 2,
+      timeoutMs: 60000,
+    });
+
+    console.log("\n=== UNDERSTAND_SYSTEM Result ===");
+    console.log(`Verdict: ${result.evaluation.overallVerdict}`);
+    console.log(`Reasoning: ${result.evaluation.reasoning}`);
+
+    // Should have a conversation at minimum
+    expect(result.conversation.length).toBeGreaterThan(0);
+  }, 90000);
+});
+
+// ═══════════════════════════════════════════════════════════════════════════════
+// PERSONA VARIATION TESTS
+// ═══════════════════════════════════════════════════════════════════════════════
+
+describe("Need Fulfillment - Persona Variations", () => {
+  let database: ChatGraphCompositeDatabase;
+  let chatAgent: ChatAgentInterface;
+  let generator: ScenarioGenerator;
+
+  beforeAll(() => {
+    database = createStatefulMockDatabase();
+    chatAgent = createChatAgentAdapter(database);
+    generator = new ScenarioGenerator();
+  });
+
+  const testPersonas: UserPersonaId[] = ["BUSY_FOUNDER", "VAGUE_USER", "NON_NATIVE_SPEAKER"];
+
+  for (const personaId of testPersonas) {
+    it(`should handle ${personaId} persona for EXPRESS_WANT`, async () => {
+      const need = USER_NEEDS.EXPRESS_WANT;
+      const persona = USER_PERSONAS[personaId];
+
+      const scenario: GeneratedScenario = {
+        id: `test-${personaId}`,
+        need,
+        persona,
+        context: { hasProfile: true, hasIntents: false, isIndexOwner: false },
+        generatedMessage: await generator.generateMessage(need, persona, {
+          hasProfile: true,
+          hasIntents: false,
+          isIndexOwner: false,
+        }),
+        evaluationCriteria: {
+          needFulfilled: need.description,
+          successSignals: need.successSignals,
+          failureSignals: need.failureSignals,
+          qualityFactors: ["Adapted to user's communication style"],
+        },
+      };
+
+      const result = await runNeedFulfillmentTest(scenario, chatAgent, {
+        verbose: true,
+        maxTurns: 3,
+        timeoutMs: 90000,
+      });
+
+      console.log(`\n=== ${personaId} Result ===`);
+      console.log(`Message: "${scenario.generatedMessage}"`);
+      console.log(`Verdict: ${result.evaluation.overallVerdict}`);
+      console.log(`Quality: ${result.evaluation.qualityScore}`);
+
+      // Agent should handle all personas - conversation should happen
+      expect(result.conversation.length).toBeGreaterThan(0);
+    }, 120000);
+  }
+});
+
+// ═══════════════════════════════════════════════════════════════════════════════
+// JOURNEY TESTS
+// ═══════════════════════════════════════════════════════════════════════════════
+
+describe("Need Fulfillment - User Journeys", () => {
+  let generator: ScenarioGenerator;
+
+  beforeAll(() => {
+    generator = new ScenarioGenerator();
+  });
+
+  it("should complete ONBOARDING_FLOW journey", async () => {
+    const database = createStatefulMockDatabase();
+    const chatAgent = createChatAgentAdapter(database);
+
+    const scenarios = await generator.generateJourneyScenario("ONBOARDING_FLOW", "NEW_USER");
+
+    console.log("\n=== ONBOARDING_FLOW Journey ===");
+
+    const results = [];
+    for (const scenario of scenarios) {
+      // Don't reset between journey steps - maintain context
+      const result = await runNeedFulfillmentTest(scenario, chatAgent, {
+        verbose: true,
+        maxTurns: 2,
+        timeoutMs: 60000,
+      });
+      results.push(result);
+    }
+
+    // At least some conversations should happen
+    const withConversation = results.filter((r) => r.conversation.length > 0).length;
+    console.log(`\nJourney conversations: ${withConversation}/${results.length}`);
+
+    expect(withConversation).toBeGreaterThanOrEqual(2);
+  }, 240000);
+
+  it("should complete INTENT_LIFECYCLE journey", async () => {
+    const database = createStatefulMockDatabase();
+    const chatAgent = createChatAgentAdapter(database);
+
+    const scenarios = await generator.generateJourneyScenario("INTENT_LIFECYCLE", "POWER_USER");
+
+    console.log("\n=== INTENT_LIFECYCLE Journey ===");
+
+    const results = [];
+    for (const scenario of scenarios) {
+      const result = await runNeedFulfillmentTest(scenario, chatAgent, {
+        verbose: true,
+        maxTurns: 2,
+        timeoutMs: 60000,
+      });
+      results.push(result);
+    }
+
+    const withConversation = results.filter((r) => r.conversation.length > 0).length;
+    console.log(`\nJourney conversations: ${withConversation}/${results.length}`);
+
+    expect(withConversation).toBeGreaterThanOrEqual(2);
+  }, 240000);
+});
+
+// ═══════════════════════════════════════════════════════════════════════════════
+// COMPREHENSIVE TEST SUITE
+// ═══════════════════════════════════════════════════════════════════════════════
+
+describe("Comprehensive Test Suite", () => {
+  it("should run full test suite across all needs", async () => {
+    const database = createStatefulMockDatabase();
+    const chatAgent = createChatAgentAdapter(database);
+    const generator = new ScenarioGenerator();
+
+    // Only test 2 needs for speed
+    const testNeeds: UserNeedId[] = ["EXPRESS_WANT", "UNDERSTAND_SYSTEM"];
+
+    const allScenarios: GeneratedScenario[] = [];
+    for (const needId of testNeeds) {
+      const scenarios = await generator.generateScenariosForNeed(needId, 1);
+      allScenarios.push(...scenarios);
+    }
+
+    console.log(`\nRunning ${allScenarios.length} scenarios...`);
+
+    const { results, summary } = await runTestSuite(allScenarios, chatAgent, { verbose: false });
+
+    console.log("\n=== TEST SUITE SUMMARY ===");
+    console.log(`Total: ${summary.total}`);
+    console.log(`Success: ${summary.success}`);
+    console.log(`Partial: ${summary.partial}`);
+    console.log(`Failure: ${summary.failure}`);
+    console.log(`Blocked: ${summary.blocked}`);
+
+    // At least some conversations should complete
+    expect(results.some((r) => r.conversation.length > 0)).toBe(true);
+  }, 300000);
+});


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find ScenarioGenerator and NeedFulfillmentEvaluator implementations fd -t f "\.ts$" | xargs rg "class ScenarioGenerator|class NeedFulfillmentEvaluator" -l

Repository: indexnetwork/index

Length of output: 117

🏁 Script executed:

# Look for the spec file and related test utilities fd -t f "chat\.evaluator\.spec\.ts"

Repository: indexnetwork/index

Length of output: 122

🏁 Script executed:

# Check imports in the test file to understand dependencies head -50 protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts

Repository: indexnetwork/index

Length of output: 1672

🏁 Script executed:

# Search for createStatefulMockDatabase and related mock utilities rg "createStatefulMockDatabase|createChatAgentAdapter" -l

Repository: indexnetwork/index

Length of output: 182

🏁 Script executed:

# Check CI configuration fd -t f "\.(yml|yaml|json)$" | xargs rg "OPENROUTER_API_KEY" -l 2>/dev/null || echo "No CI files found with OPENROUTER_API_KEY"

Repository: indexnetwork/index

Length of output: 103

🏁 Script executed:

# Get line count of chat.evaluator.ts to understand its size wc -l protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts

Repository: indexnetwork/index

Length of output: 121

🏁 Script executed:

# Check ScenarioGenerator class definition and key methods rg "class ScenarioGenerator|async generateScenariosForNeed|async generateMessage" protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts -A 10

Repository: indexnetwork/index

Length of output: 1281

🏁 Script executed:

# Look for LLM-related imports (OpenRouter, ChatOpenAI, etc.) in chat.evaluator.ts rg "import.*openrouter|import.*ChatOpenAI|import.*LLM|from.*langchain" protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts

Repository: indexnetwork/index

Length of output: 181

🏁 Script executed:

# Check for environment variable usage in chat.evaluator.ts rg "OPENROUTER_API_KEY|process\.env\." protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts

Repository: indexnetwork/index

Length of output: 457

🏁 Script executed:

# Check if there's any describe.skip or test gating in the spec file rg "describe\.skip|it\.skip|process\.env\.|hasOpenRouterKey" protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts

Repository: indexnetwork/index

Length of output: 44

🏁 Script executed:

# Look for any mock implementations or jest mock setup rg "jest\.mock|vi\.mock|mock|Mock" protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts

Repository: indexnetwork/index

Length of output: 691

🏁 Script executed:

# Check for .env.development or other env handling at start of spec head -20 protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts

Repository: indexnetwork/index

Length of output: 629

🏁 Script executed:

# Verify NeedFulfillmentEvaluator also uses LLM rg "class NeedFulfillmentEvaluator" protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts -A 15

Repository: indexnetwork/index

Length of output: 475

🏁 Script executed:

# Check git history or comments about test intent (integration vs unit) rg "integration|unit test|TODO|FIXME" protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts -B 2

Repository: indexnetwork/index

Length of output: 44

🏁 Script executed:

# Check if there are any .env files in the repo find . -name ".env*" -type f 2>/dev/null | head -20

Repository: indexnetwork/index

Length of output: 44

🏁 Script executed:

# Look for CI configuration files (GitHub Actions, etc.) fd -t f "(\.github|\.gitlab-ci|\.circleci|jenkins|travis)" | head -20

Repository: indexnetwork/index

Length of output: 44

🏁 Script executed:

# Check if GitHub Actions workflow references these test files find .github -name "*.yml" -o -name "*.yaml" 2>/dev/null | xargs cat 2>/dev/null | rg "chat\.evaluator|OPENROUTER" || echo "No GitHub Actions workflows found"

Repository: indexnetwork/index

Length of output: 95

🏁 Script executed:

# Verify that env loading happens at top level and will fail if key missing sed -n '1,20p' protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts

Repository: indexnetwork/index

Length of output: 629

🏁 Script executed:

# Check how ChatOpenAI is instantiated - will it throw if apiKey is undefined? rg "new ChatOpenAI" protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts -B 2 -A 8

Repository: indexnetwork/index

Length of output: 1061

🏁 Script executed:

# Verify if tests will fail when OPENROUTER_API_KEY is missing rg "generateMessage|generateScenariosForNeed" protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts -A 5 | head -40

Repository: indexnetwork/index

Length of output: 977

Gate or mock live LLM calls in tests to avoid CI failures.

These tests make direct external LLM calls via ScenarioGenerator, NeedFulfillmentEvaluator, and NeedFulfillmentTest, which instantiate ChatOpenAI with process.env.OPENROUTER_API_KEY. Without this environment variable, the tests will fail immediately. Since there's no CI configuration providing this key and no gating mechanism, these tests will block CI runs.

Gate these tests to run only when the API key is present, or inject a mock model for CI environments:

🔧 Example gating pattern (apply to all describe blocks)

+const hasOpenRouterKey = Boolean(process.env.OPENROUTER_API_KEY); +const describeLLM = hasOpenRouterKey ? describe : describe.skip; -describe("Scenario Generation", () => { +describeLLM("Scenario Generation", () => { it("should generate diverse messages for EXPRESS_WANT need", async () => { ... }); });

🤖 Prompt for AI Agents

In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts` around lines 191 - 447, The tests call real LLMs (via ScenarioGenerator, NeedFulfillmentEvaluator, NeedFulfillmentTest and ChatOpenAI) and will fail in CI without OPENROUTER_API_KEY; update the spec to gate or mock these calls by checking process.env.OPENROUTER_API_KEY in each top-level describe (or in a shared beforeAll) and skip the suite when absent, or inject a mock model/agent from createChatAgentAdapter/runNeedFulfillmentTest/runTestSuite when the env var is missing; ensure the gating uses the unique symbols ScenarioGenerator, runNeedFulfillmentTest, runTestSuite, createChatAgentAdapter, and ChatOpenAI so tests either use a deterministic mock implementation in CI or are skipped when the API key is not available.

coderabbitai · 2026-02-03T02:20:13Z

protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts

+  if (options?.parallel) {
+    const promises = scenarios.map((s) => runNeedFulfillmentTest(s, chatAgent, testOptions));
+    results.push(...(await Promise.all(promises)));
+  } else {
+    for (const scenario of scenarios) {
+      results.push(await runNeedFulfillmentTest(scenario, chatAgent, testOptions));
+    }


⚠️ Potential issue | 🟠 Major

Parallel runTestSuite shares a mutable chatAgent and can corrupt results.
runNeedFulfillmentTest resets and mutates the shared agent; running it in Promise.all races state across scenarios. Consider requiring a factory that returns a fresh agent per scenario, or disable parallel mode.

🧯 Minimal safety guard

- if (options?.parallel) { - const promises = scenarios.map((s) => runNeedFulfillmentTest(s, chatAgent, testOptions)); - results.push(...(await Promise.all(promises))); - } else { - for (const scenario of scenarios) { - results.push(await runNeedFulfillmentTest(scenario, chatAgent, testOptions)); - } - } + if (options?.parallel) { + throw new Error("Parallel execution requires a dedicated ChatAgent per scenario"); + } + for (const scenario of scenarios) { + results.push(await runNeedFulfillmentTest(scenario, chatAgent, testOptions)); + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (options?.parallel) {

const promises = scenarios.map((s) => runNeedFulfillmentTest(s, chatAgent, testOptions));

results.push(...(await Promise.all(promises)));

} else {

for (const scenario of scenarios) {

results.push(await runNeedFulfillmentTest(scenario, chatAgent, testOptions));

}

if (options?.parallel) {

throw new Error("Parallel execution requires a dedicated ChatAgent per scenario");

}

for (const scenario of scenarios) {

results.push(await runNeedFulfillmentTest(scenario, chatAgent, testOptions));

}

🤖 Prompt for AI Agents

In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts` around lines 939 - 945, The parallel branch uses a shared mutable chatAgent with runNeedFulfillmentTest which resets/mutates the agent and causes race conditions; change the API or call site to supply a fresh agent per scenario (e.g., accept an agentFactory() instead of chatAgent, or clone/initialize a new agent for each scenario inside the parallel branch) and update the parallel path to call agentFactory() for each Promise, or alternatively disable/throw when options?.parallel is true and no factory is provided; refer to runNeedFulfillmentTest and the local chatAgent variable to locate where to inject the factory/clone logic or the validation that prevents unsafe parallel execution.

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add chat evaluation reports and evaluator implementation#387

Add chat evaluation reports and evaluator implementation#387
serefyarar wants to merge 1 commit intodevfrom
test-maniac

serefyarar commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 3, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 3, 2026

Uh oh!

coderabbitai bot Feb 3, 2026

Uh oh!

coderabbitai bot Feb 3, 2026

Uh oh!

coderabbitai bot Feb 3, 2026

Uh oh!

coderabbitai bot Feb 3, 2026

Uh oh!

coderabbitai bot Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

serefyarar commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

serefyarar commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 3, 2026 •

edited

Loading