Skip to content

Add chat evaluation reports and evaluator implementation#387

Open
serefyarar wants to merge 1 commit intodevfrom
test-maniac
Open

Add chat evaluation reports and evaluator implementation#387
serefyarar wants to merge 1 commit intodevfrom
test-maniac

Conversation

@serefyarar
Copy link
Contributor

@serefyarar serefyarar commented Feb 3, 2026

Added new chat evaluation reports and results in the protocol directory, including markdown and JSON files for multiple scenarios. Introduced chat evaluator implementation and corresponding tests under src/lib/protocol/graphs/chat, providing the core logic for running and testing chat evaluation flows.

Summary by CodeRabbit

Release Notes

  • Documentation

    • Added comprehensive Chat Agent evaluation reports documenting performance across multiple interaction scenarios, including success/failure metrics, detailed analysis, and identified issues with recommendations for improvement.
    • Added detailed conversation logs with representative interactions and evaluation narratives.
  • Chores

    • Added eval:chat script to enable running Chat Agent evaluations.

Added new chat evaluation reports and results in the protocol directory, including markdown and JSON files for multiple scenarios. Introduced chat evaluator implementation and corresponding tests under src/lib/protocol/graphs/chat, providing the core logic for running and testing chat evaluation flows.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

A comprehensive chat agent evaluation framework is introduced with scenario generation, need fulfillment assessment, simulated user interaction, and test orchestration that produces structured reports and aggregated metrics from multiple evaluation runs.

Changes

Cohort / File(s) Summary
Evaluation Framework
protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts
Core framework defining user needs, personas, journeys taxonomies; implements ScenarioGenerator (LLM-driven), NeedFulfillmentEvaluator, SimulatedUser, and test orchestration infrastructure (runNeedFulfillmentTest, runTestSuite) with public interfaces for external integration.
Test Harness & Spec
protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts, protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts
Execution harness with mocked ChatGraphCompositeDatabase, embedder, scraper, and ChatAgentInterface adapter; includes manual scenario definitions (INTENT_EXPRESSION, DISCOVERY, COMBINED), result aggregation, and report/log generation. Spec file provides comprehensive test suite with stateful mock database and tool-tracking via createChatAgentAdapter export.
Evaluation Reports
protocol/chat-eval-report-2026-02-03T01-42-49-355Z.md, protocol/chat-eval-report-2026-02-03T01-56-27.md, protocol/chat-eval-report-2026-02-03T02-01-12.md, protocol/eval-reports/chat-eval-report-2026-02-03T02-06-25.md, protocol/eval-reports/chat-eval-conversations-2026-02-03T02-06-25.md
Markdown evaluation reports documenting executive metrics (success/partial/failure counts), per-category results, tool usage patterns, key issues, per-scenario verdicts with conversation excerpts, and recommendations for agent improvements.
Evaluation Results Data
protocol/chat-eval-results-2026-02-03T01-42-49-355Z.json, protocol/eval-reports/chat-eval-results-2026-02-03T02-06-25.json
Structured JSON datasets capturing 24 evaluation scenarios with metadata (category, verdict, score, tools used, duration), conversation history, and signal analysis (successSignals, failureSignals) for post-hoc analytics.
Configuration
protocol/package.json
Added npm script eval:chat pointing to the evaluator harness runner (bun ./src/lib/protocol/graphs/chat/chat.evaluator.run.ts).

Sequence Diagram

sequenceDiagram
    participant Main as Evaluation Main
    participant ScenarioGen as ScenarioGenerator
    participant ChatAgent as ChatAgentAdapter
    participant SimUser as SimulatedUser
    participant Evaluator as NeedFulfillmentEvaluator
    participant DB as Mock Database
    participant Reporter as ReportGenerator

    Main->>ScenarioGen: generateMessage(need, persona)
    ScenarioGen-->>Main: generatedMessage
    
    Main->>SimUser: new SimulatedUser(scenario)
    SimUser-->>Main: initialized
    
    Main->>ChatAgent: reset()
    ChatAgent->>DB: initialize state
    
    loop Conversation Turns
        Main->>SimUser: getInitialMessage()
        SimUser-->>Main: userMessage
        
        Main->>ChatAgent: chat(userMessage)
        ChatAgent->>DB: execute graph operations
        DB-->>ChatAgent: response + toolsUsed
        ChatAgent-->>Main: response, toolsUsed
        
        Main->>SimUser: respond(assistantMessage)
        SimUser-->>Main: decision to continue
    end
    
    Main->>Evaluator: evaluate(scenario, conversation, toolsUsed)
    Evaluator-->>Main: evaluation result (verdict, score, signals)
    
    Main->>Reporter: generateReport(results)
    Reporter-->>Main: markdown report
    
    Main->>Reporter: generateConversationsLog(results)
    Reporter-->>Main: conversation log
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

codex

Poem

🐰 A rabbit hops through scenarios so fine,
Needs, personas, journeys intertwined,
LLM-crafted conversations flow,
Evaluations bloom, what secrets they know!
Reports and metrics, all neatly bound,
Testing a chat agent, soundly and round. 🐇

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add chat evaluation reports and evaluator implementation' accurately summarizes the main changes: adding evaluation report files and implementing the evaluator framework.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch test-maniac

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@protocol/eval-reports/chat-eval-report-2026-02-03T02-01-12.md`:
- Around line 827-833: The recommendations list generator is producing
non-sequential numbering (1,2,4) because the code that assembles or renders the
recommendations list skips an index; locate the evaluator function responsible
for producing the recommendations block (e.g., generateRecommendations,
buildRecommendationsList, or format_recommendations) and fix the indexing logic
so it emits sequential numbers: ensure the counter is initialized once and
incremented for each recommendation, avoid using sparse keys or filtered arrays
that preserve original indices without reindexing, and update the renderer to
enumerate items by their current position rather than original IDs so item 3 is
not omitted.

In `@protocol/eval-reports/chat-eval-report-2026-02-03T02-06-25.md`:
- Around line 786-792: The numbered list under the "Recommendations" header
currently starts at 2 and skips 1 (see items beginning with "2. **Improve
discovery error handling:**" etc.); update the report formatter or the markdown
content so the recommendations are sequentially numbered beginning at 1 (rename
"2." → "1.", "3." → "2.", "4." → "3."), ensure any cross-references or IDs
linked to these recommendation items are updated accordingly, and run the
formatter that generates this section (the function/component that emits the
"Recommendations" block) to prevent future off-by-one numbering regressions.

In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts`:
- Around line 658-681: The generated markdown fenced code blocks are missing a
language identifier which triggers MD040; update the places that push the
opening fence (currently lines.push("```")) to use a text identifier
(lines.push("```text")) where the conversation block is built (look for the code
that inspects result.conversation and pushes "```" before and after the loop) so
both the empty-conversation branch and the loop branch emit "```text" as the
opening fence while keeping the closing fence as "```".
- Around line 734-741: The table incorrectly labels stats.toolUsage as "Times
Called" even though stats.toolUsage is built from per-scenario unique
metadata.toolsUsed; update the presentation or the metric: either rename the
column header to something like "Scenarios Used" (change the string literal in
chat.evaluator.run.ts where the header row is built) to accurately reflect the
data, or change the aggregation that populates stats.toolUsage to count every
invocation (modify the code that collects metadata.toolsUsed into
stats.toolUsage to increment per-call rather than per-scenario) and ensure the
variable name and sorting still match the rest of the logic; update any
references to stats.toolUsage accordingly.

In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts`:
- Around line 191-447: The tests call real LLMs (via ScenarioGenerator,
NeedFulfillmentEvaluator, NeedFulfillmentTest and ChatOpenAI) and will fail in
CI without OPENROUTER_API_KEY; update the spec to gate or mock these calls by
checking process.env.OPENROUTER_API_KEY in each top-level describe (or in a
shared beforeAll) and skip the suite when absent, or inject a mock model/agent
from createChatAgentAdapter/runNeedFulfillmentTest/runTestSuite when the env var
is missing; ensure the gating uses the unique symbols ScenarioGenerator,
runNeedFulfillmentTest, runTestSuite, createChatAgentAdapter, and ChatOpenAI so
tests either use a deterministic mock implementation in CI or are skipped when
the API key is not available.

In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts`:
- Around line 939-945: The parallel branch uses a shared mutable chatAgent with
runNeedFulfillmentTest which resets/mutates the agent and causes race
conditions; change the API or call site to supply a fresh agent per scenario
(e.g., accept an agentFactory() instead of chatAgent, or clone/initialize a new
agent for each scenario inside the parallel branch) and update the parallel path
to call agentFactory() for each Promise, or alternatively disable/throw when
options?.parallel is true and no factory is provided; refer to
runNeedFulfillmentTest and the local chatAgent variable to locate where to
inject the factory/clone logic or the validation that prevents unsafe parallel
execution.
🧹 Nitpick comments (4)
protocol/chat-eval-results-2026-02-03T01-42-49-355Z.json (1)

1-909: Inconsistent file location for evaluation results.

This JSON results file is placed in protocol/ root, while other evaluation artifacts (like chat-eval-results-2026-02-03T02-06-25.json) are in protocol/eval-reports/. Consider moving this file to protocol/eval-reports/ for consistent organization.

Additionally, consider whether these timestamped evaluation result files should be committed to the repository at all. If they are generated artifacts from running the evaluator, they may be better suited for:

  • A .gitignore entry (if they're local development artifacts)
  • A CI artifact storage (if they're needed for historical tracking)
protocol/eval-reports/chat-eval-conversations-2026-02-03T02-06-25.md (1)

29-40: Consider adding language specifier to fenced code blocks.

The conversation transcript code blocks lack a language specifier, triggering markdownlint warnings. While the content renders correctly, adding text or plaintext as the language would satisfy linters and improve consistency.

Example change:

-```
+```text
 👤 USER:
 I'm looking to hire ML engineers for my startup
 ...

This pattern applies to all 24 conversation blocks throughout the file.

protocol/chat-eval-report-2026-02-03T01-42-49-355Z.md (1)

1-804: Inconsistent file location for evaluation report.

This report file is located in protocol/ root, while similar reports (e.g., chat-eval-report-2026-02-03T02-06-25.md) are in protocol/eval-reports/. Move this file to protocol/eval-reports/ for consistent organization.

protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts (1)

27-94: Align mock intent shape with origin fields and tighten typing.
The mock state relies heavily on any casts and as unknown as, which weakens strict mode, and the created intent lacks origin tracking fields. This can mask schema-dependent behavior in tests/harness runs. Consider a typed mock state and include sourceType/sourceId in created intents.

🔧 Suggested adjustment for intent origin fields
       const intent = {
         id: `intent-${Date.now()}-${Math.random().toString(36).slice(2)}`,
         payload: data.payload,
         summary: data.payload.slice(0, 100),
         userId: data.userId,
         createdAt: new Date(),
         updatedAt: new Date(),
         deletedAt: null,
         isIncognito: false,
+        sourceType: data.sourceType ?? "test",
+        sourceId: data.sourceId ?? data.userId,
       };

As per coding guidelines, use strict TypeScript mode for all code and intents must track their origin via polymorphic sourceType and sourceId fields.

Comment on lines +827 to +833
## Recommendations

1. **Rebalance tool selection:** The agent calls `find_opportunities` much more than `create_intent`. When users express a want/need, the agent should first capture it as an intent before searching.

2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).

4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Recommendations numbering skips item 3.

The recommendations are numbered 1, 2, 4 - missing item 3. This is likely a bug in the report generation logic that should be fixed in the evaluator code.

Proposed fix
 ## Recommendations
 
 1. **Rebalance tool selection:** The agent calls `find_opportunities` much more than `create_intent`. When users express a want/need, the agent should first capture it as an intent before searching.
 
 2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).
 
-4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.
+3. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Recommendations
1. **Rebalance tool selection:** The agent calls `find_opportunities` much more than `create_intent`. When users express a want/need, the agent should first capture it as an intent before searching.
2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).
4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.
## Recommendations
1. **Rebalance tool selection:** The agent calls `find_opportunities` much more than `create_intent`. When users express a want/need, the agent should first capture it as an intent before searching.
2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).
3. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.
🤖 Prompt for AI Agents
In `@protocol/eval-reports/chat-eval-report-2026-02-03T02-01-12.md` around lines
827 - 833, The recommendations list generator is producing non-sequential
numbering (1,2,4) because the code that assembles or renders the recommendations
list skips an index; locate the evaluator function responsible for producing the
recommendations block (e.g., generateRecommendations, buildRecommendationsList,
or format_recommendations) and fix the indexing logic so it emits sequential
numbers: ensure the counter is initialized once and incremented for each
recommendation, avoid using sparse keys or filtered arrays that preserve
original indices without reindexing, and update the renderer to enumerate items
by their current position rather than original IDs so item 3 is not omitted.

Comment on lines +786 to +792
## Recommendations

2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).

3. **Distinguish offers from wants:** The agent treats user offers (e.g., 'I do freelance work') the same as wants. It should recognize offers and create appropriate intents.

4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Recommendations numbering is incorrect.

The recommendations section starts at item 2 and skips item 1. This appears to be a generation bug in the report formatter.

Proposed fix
 ## Recommendations
 
-2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).
+1. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).
 
-3. **Distinguish offers from wants:** The agent treats user offers (e.g., 'I do freelance work') the same as wants. It should recognize offers and create appropriate intents.
+2. **Distinguish offers from wants:** The agent treats user offers (e.g., 'I do freelance work') the same as wants. It should recognize offers and create appropriate intents.
 
-4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.
+3. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Recommendations
2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).
3. **Distinguish offers from wants:** The agent treats user offers (e.g., 'I do freelance work') the same as wants. It should recognize offers and create appropriate intents.
4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.
## Recommendations
1. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).
2. **Distinguish offers from wants:** The agent treats user offers (e.g., 'I do freelance work') the same as wants. It should recognize offers and create appropriate intents.
3. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.
🤖 Prompt for AI Agents
In `@protocol/eval-reports/chat-eval-report-2026-02-03T02-06-25.md` around lines
786 - 792, The numbered list under the "Recommendations" header currently starts
at 2 and skips 1 (see items beginning with "2. **Improve discovery error
handling:**" etc.); update the report formatter or the markdown content so the
recommendations are sequentially numbered beginning at 1 (rename "2." → "1.",
"3." → "2.", "4." → "3."), ensure any cross-references or IDs linked to these
recommendation items are updated accordingly, and run the formatter that
generates this section (the function/component that emits the "Recommendations"
block) to prevent future off-by-one numbering regressions.

Comment on lines +658 to +681
lines.push("**Conversation:**");
lines.push("");
lines.push("```");

if (result.conversation.length === 0) {
lines.push("(No conversation recorded)");
} else {
for (const turn of result.conversation) {
const prefix = turn.role === "user" ? "👤 USER" : "🤖 ASSISTANT";
lines.push(`${prefix}:`);
// Wrap long lines
const content = turn.content.split("\n").map(line => {
if (line.length > 100) {
return line.match(/.{1,100}/g)?.join("\n ") || line;
}
return line;
}).join("\n");
lines.push(content);
lines.push("");
}
}

lines.push("```");
lines.push("");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks.
Generated reports/logs emit fenced blocks without a language, triggering MD040 warnings. Consider ```text for these snippets.

🔧 Suggested change
-  lines.push("```");
+  lines.push("```text");
   if (result.conversation.length === 0) {
     lines.push("(No conversation recorded)");
   } else {
     for (const turn of result.conversation) {
       const prefix = turn.role === "user" ? "👤 USER" : "🤖 ASSISTANT";
       lines.push(`${prefix}:`);
       ...
     }
   }
-  lines.push("```");
+  lines.push("```");
-        lines.push("```");
+        lines.push("```text");
         for (const turn of result.conversation) {
           const prefix = turn.role === "user" ? "👤 USER" : "🤖 ASSISTANT";
           lines.push(`${prefix}:`);
           lines.push(turn.content);
           lines.push("");
         }
-        lines.push("```");
+        lines.push("```");
🤖 Prompt for AI Agents
In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts` around lines 658
- 681, The generated markdown fenced code blocks are missing a language
identifier which triggers MD040; update the places that push the opening fence
(currently lines.push("```")) to use a text identifier (lines.push("```text"))
where the conversation block is built (look for the code that inspects
result.conversation and pushes "```" before and after the loop) so both the
empty-conversation branch and the loop branch emit "```text" as the opening
fence while keeping the closing fence as "```".

Comment on lines +734 to +741
// Tool Usage
lines.push("## Tool Usage Patterns");
lines.push("");
lines.push("| Tool | Times Called |");
lines.push("|------|--------------|");
for (const [tool, count] of Object.entries(stats.toolUsage).sort((a, b) => b[1] - a[1])) {
lines.push(`| ${tool} | ${count} |`);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Tool usage is deduped per scenario but labeled as “Times Called.”
toolUsage counts unique tools per scenario (from metadata.toolsUsed), not actual call counts. Either track raw call counts or rename the column to avoid misleading metrics.

📝 Minimal label fix
-  lines.push("| Tool | Times Called |");
-  lines.push("|------|--------------|");
+  lines.push("| Tool | Scenarios Used |");
+  lines.push("|------|----------------|");
🤖 Prompt for AI Agents
In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts` around lines 734
- 741, The table incorrectly labels stats.toolUsage as "Times Called" even
though stats.toolUsage is built from per-scenario unique metadata.toolsUsed;
update the presentation or the metric: either rename the column header to
something like "Scenarios Used" (change the string literal in
chat.evaluator.run.ts where the header row is built) to accurately reflect the
data, or change the aggregation that populates stats.toolUsage to count every
invocation (modify the code that collects metadata.toolsUsed into
stats.toolUsage to increment per-call rather than per-scenario) and ensure the
variable name and sorting still match the rest of the logic; update any
references to stats.toolUsage accordingly.

Comment on lines +191 to +447
describe("Scenario Generation", () => {
it("should generate diverse messages for EXPRESS_WANT need", async () => {
const generator = new ScenarioGenerator();
const scenarios = await generator.generateScenariosForNeed("EXPRESS_WANT", 3);

expect(scenarios.length).toBe(3);

// Each scenario should have different persona
const personas = new Set(scenarios.map((s) => s.persona.id));
expect(personas.size).toBeGreaterThan(1);

// Messages should be non-empty and different
const messages = scenarios.map((s) => s.generatedMessage);
expect(messages.every((m) => m.length > 5)).toBe(true);

console.log("\nGenerated EXPRESS_WANT scenarios:");
for (const s of scenarios) {
console.log(` [${s.persona.id}]: "${s.generatedMessage}"`);
}
}, 60000);

it("should generate journey scenarios with context progression", async () => {
const generator = new ScenarioGenerator();
const scenarios = await generator.generateJourneyScenario("ONBOARDING_FLOW", "NEW_USER");

expect(scenarios.length).toBe(3); // ESTABLISH_PRESENCE, EXPRESS_WANT, FIND_PEOPLE

// Context should evolve
expect(scenarios[0].context.hasProfile).toBe(false);
expect(scenarios[1].context.hasProfile).toBe(true); // After ESTABLISH_PRESENCE

console.log("\nGenerated ONBOARDING_FLOW journey:");
for (const s of scenarios) {
console.log(` [${s.need.id}]: "${s.generatedMessage}"`);
}
}, 90000);
});

// ═══════════════════════════════════════════════════════════════════════════════
// SINGLE NEED FULFILLMENT TESTS
// ═══════════════════════════════════════════════════════════════════════════════

describe("Need Fulfillment - Single Needs", () => {
let database: ChatGraphCompositeDatabase;
let chatAgent: ChatAgentInterface;
let generator: ScenarioGenerator;

beforeAll(() => {
database = createStatefulMockDatabase();
chatAgent = createChatAgentAdapter(database);
generator = new ScenarioGenerator();
});

it("should fulfill EXPRESS_WANT need", async () => {
const scenarios = await generator.generateScenariosForNeed("EXPRESS_WANT", 1);
const result = await runNeedFulfillmentTest(scenarios[0], chatAgent, {
verbose: true,
maxTurns: 3,
timeoutMs: 90000,
});

console.log("\n=== EXPRESS_WANT Result ===");
console.log(`Verdict: ${result.evaluation.overallVerdict}`);
console.log(`Score: ${result.evaluation.fulfillmentScore}`);
console.log(`Reasoning: ${result.evaluation.reasoning}`);
console.log(`Tools: ${result.metadata.toolsUsed.join(", ")}`);

// We expect success or partial - the agent should at least try
expect(["success", "partial"]).toContain(result.evaluation.overallVerdict);
}, 120000);

it("should fulfill FIND_PEOPLE need", async () => {
const scenarios = await generator.generateScenariosForNeed("FIND_PEOPLE", 1);
const result = await runNeedFulfillmentTest(scenarios[0], chatAgent, {
verbose: true,
maxTurns: 3,
timeoutMs: 90000,
});

console.log("\n=== FIND_PEOPLE Result ===");
console.log(`Verdict: ${result.evaluation.overallVerdict}`);
console.log(`Reasoning: ${result.evaluation.reasoning}`);

// Agent should use discovery tool
expect(result.metadata.toolsUsed.length).toBeGreaterThan(0);
}, 120000);

it("should handle UNDERSTAND_SYSTEM need without tools", async () => {
const scenarios = await generator.generateScenariosForNeed("UNDERSTAND_SYSTEM", 1);
const result = await runNeedFulfillmentTest(scenarios[0], chatAgent, {
verbose: true,
maxTurns: 2,
timeoutMs: 60000,
});

console.log("\n=== UNDERSTAND_SYSTEM Result ===");
console.log(`Verdict: ${result.evaluation.overallVerdict}`);
console.log(`Reasoning: ${result.evaluation.reasoning}`);

// Should have a conversation at minimum
expect(result.conversation.length).toBeGreaterThan(0);
}, 90000);
});

// ═══════════════════════════════════════════════════════════════════════════════
// PERSONA VARIATION TESTS
// ═══════════════════════════════════════════════════════════════════════════════

describe("Need Fulfillment - Persona Variations", () => {
let database: ChatGraphCompositeDatabase;
let chatAgent: ChatAgentInterface;
let generator: ScenarioGenerator;

beforeAll(() => {
database = createStatefulMockDatabase();
chatAgent = createChatAgentAdapter(database);
generator = new ScenarioGenerator();
});

const testPersonas: UserPersonaId[] = ["BUSY_FOUNDER", "VAGUE_USER", "NON_NATIVE_SPEAKER"];

for (const personaId of testPersonas) {
it(`should handle ${personaId} persona for EXPRESS_WANT`, async () => {
const need = USER_NEEDS.EXPRESS_WANT;
const persona = USER_PERSONAS[personaId];

const scenario: GeneratedScenario = {
id: `test-${personaId}`,
need,
persona,
context: { hasProfile: true, hasIntents: false, isIndexOwner: false },
generatedMessage: await generator.generateMessage(need, persona, {
hasProfile: true,
hasIntents: false,
isIndexOwner: false,
}),
evaluationCriteria: {
needFulfilled: need.description,
successSignals: need.successSignals,
failureSignals: need.failureSignals,
qualityFactors: ["Adapted to user's communication style"],
},
};

const result = await runNeedFulfillmentTest(scenario, chatAgent, {
verbose: true,
maxTurns: 3,
timeoutMs: 90000,
});

console.log(`\n=== ${personaId} Result ===`);
console.log(`Message: "${scenario.generatedMessage}"`);
console.log(`Verdict: ${result.evaluation.overallVerdict}`);
console.log(`Quality: ${result.evaluation.qualityScore}`);

// Agent should handle all personas - conversation should happen
expect(result.conversation.length).toBeGreaterThan(0);
}, 120000);
}
});

// ═══════════════════════════════════════════════════════════════════════════════
// JOURNEY TESTS
// ═══════════════════════════════════════════════════════════════════════════════

describe("Need Fulfillment - User Journeys", () => {
let generator: ScenarioGenerator;

beforeAll(() => {
generator = new ScenarioGenerator();
});

it("should complete ONBOARDING_FLOW journey", async () => {
const database = createStatefulMockDatabase();
const chatAgent = createChatAgentAdapter(database);

const scenarios = await generator.generateJourneyScenario("ONBOARDING_FLOW", "NEW_USER");

console.log("\n=== ONBOARDING_FLOW Journey ===");

const results = [];
for (const scenario of scenarios) {
// Don't reset between journey steps - maintain context
const result = await runNeedFulfillmentTest(scenario, chatAgent, {
verbose: true,
maxTurns: 2,
timeoutMs: 60000,
});
results.push(result);
}

// At least some conversations should happen
const withConversation = results.filter((r) => r.conversation.length > 0).length;
console.log(`\nJourney conversations: ${withConversation}/${results.length}`);

expect(withConversation).toBeGreaterThanOrEqual(2);
}, 240000);

it("should complete INTENT_LIFECYCLE journey", async () => {
const database = createStatefulMockDatabase();
const chatAgent = createChatAgentAdapter(database);

const scenarios = await generator.generateJourneyScenario("INTENT_LIFECYCLE", "POWER_USER");

console.log("\n=== INTENT_LIFECYCLE Journey ===");

const results = [];
for (const scenario of scenarios) {
const result = await runNeedFulfillmentTest(scenario, chatAgent, {
verbose: true,
maxTurns: 2,
timeoutMs: 60000,
});
results.push(result);
}

const withConversation = results.filter((r) => r.conversation.length > 0).length;
console.log(`\nJourney conversations: ${withConversation}/${results.length}`);

expect(withConversation).toBeGreaterThanOrEqual(2);
}, 240000);
});

// ═══════════════════════════════════════════════════════════════════════════════
// COMPREHENSIVE TEST SUITE
// ═══════════════════════════════════════════════════════════════════════════════

describe("Comprehensive Test Suite", () => {
it("should run full test suite across all needs", async () => {
const database = createStatefulMockDatabase();
const chatAgent = createChatAgentAdapter(database);
const generator = new ScenarioGenerator();

// Only test 2 needs for speed
const testNeeds: UserNeedId[] = ["EXPRESS_WANT", "UNDERSTAND_SYSTEM"];

const allScenarios: GeneratedScenario[] = [];
for (const needId of testNeeds) {
const scenarios = await generator.generateScenariosForNeed(needId, 1);
allScenarios.push(...scenarios);
}

console.log(`\nRunning ${allScenarios.length} scenarios...`);

const { results, summary } = await runTestSuite(allScenarios, chatAgent, { verbose: false });

console.log("\n=== TEST SUITE SUMMARY ===");
console.log(`Total: ${summary.total}`);
console.log(`Success: ${summary.success}`);
console.log(`Partial: ${summary.partial}`);
console.log(`Failure: ${summary.failure}`);
console.log(`Blocked: ${summary.blocked}`);

// At least some conversations should complete
expect(results.some((r) => r.conversation.length > 0)).toBe(true);
}, 300000);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find ScenarioGenerator and NeedFulfillmentEvaluator implementations
fd -t f "\.ts$" | xargs rg "class ScenarioGenerator|class NeedFulfillmentEvaluator" -l

Repository: indexnetwork/index

Length of output: 117


🏁 Script executed:

# Look for the spec file and related test utilities
fd -t f "chat\.evaluator\.spec\.ts"

Repository: indexnetwork/index

Length of output: 122


🏁 Script executed:

# Check imports in the test file to understand dependencies
head -50 protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts

Repository: indexnetwork/index

Length of output: 1672


🏁 Script executed:

# Search for createStatefulMockDatabase and related mock utilities
rg "createStatefulMockDatabase|createChatAgentAdapter" -l

Repository: indexnetwork/index

Length of output: 182


🏁 Script executed:

# Check CI configuration
fd -t f "\.(yml|yaml|json)$" | xargs rg "OPENROUTER_API_KEY" -l 2>/dev/null || echo "No CI files found with OPENROUTER_API_KEY"

Repository: indexnetwork/index

Length of output: 103


🏁 Script executed:

# Get line count of chat.evaluator.ts to understand its size
wc -l protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts

Repository: indexnetwork/index

Length of output: 121


🏁 Script executed:

# Check ScenarioGenerator class definition and key methods
rg "class ScenarioGenerator|async generateScenariosForNeed|async generateMessage" protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts -A 10

Repository: indexnetwork/index

Length of output: 1281


🏁 Script executed:

# Look for LLM-related imports (OpenRouter, ChatOpenAI, etc.) in chat.evaluator.ts
rg "import.*openrouter|import.*ChatOpenAI|import.*LLM|from.*langchain" protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts

Repository: indexnetwork/index

Length of output: 181


🏁 Script executed:

# Check for environment variable usage in chat.evaluator.ts
rg "OPENROUTER_API_KEY|process\.env\." protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts

Repository: indexnetwork/index

Length of output: 457


🏁 Script executed:

# Check if there's any describe.skip or test gating in the spec file
rg "describe\.skip|it\.skip|process\.env\.|hasOpenRouterKey" protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts

Repository: indexnetwork/index

Length of output: 44


🏁 Script executed:

# Look for any mock implementations or jest mock setup
rg "jest\.mock|vi\.mock|mock|Mock" protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts

Repository: indexnetwork/index

Length of output: 691


🏁 Script executed:

# Check for .env.development or other env handling at start of spec
head -20 protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts

Repository: indexnetwork/index

Length of output: 629


🏁 Script executed:

# Verify NeedFulfillmentEvaluator also uses LLM
rg "class NeedFulfillmentEvaluator" protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts -A 15

Repository: indexnetwork/index

Length of output: 475


🏁 Script executed:

# Check git history or comments about test intent (integration vs unit)
rg "integration|unit test|TODO|FIXME" protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts -B 2

Repository: indexnetwork/index

Length of output: 44


🏁 Script executed:

# Check if there are any .env files in the repo
find . -name ".env*" -type f 2>/dev/null | head -20

Repository: indexnetwork/index

Length of output: 44


🏁 Script executed:

# Look for CI configuration files (GitHub Actions, etc.)
fd -t f "(\.github|\.gitlab-ci|\.circleci|jenkins|travis)" | head -20

Repository: indexnetwork/index

Length of output: 44


🏁 Script executed:

# Check if GitHub Actions workflow references these test files
find .github -name "*.yml" -o -name "*.yaml" 2>/dev/null | xargs cat 2>/dev/null | rg "chat\.evaluator|OPENROUTER" || echo "No GitHub Actions workflows found"

Repository: indexnetwork/index

Length of output: 95


🏁 Script executed:

# Verify that env loading happens at top level and will fail if key missing
sed -n '1,20p' protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts

Repository: indexnetwork/index

Length of output: 629


🏁 Script executed:

# Check how ChatOpenAI is instantiated - will it throw if apiKey is undefined?
rg "new ChatOpenAI" protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts -B 2 -A 8

Repository: indexnetwork/index

Length of output: 1061


🏁 Script executed:

# Verify if tests will fail when OPENROUTER_API_KEY is missing
rg "generateMessage|generateScenariosForNeed" protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts -A 5 | head -40

Repository: indexnetwork/index

Length of output: 977


Gate or mock live LLM calls in tests to avoid CI failures.

These tests make direct external LLM calls via ScenarioGenerator, NeedFulfillmentEvaluator, and NeedFulfillmentTest, which instantiate ChatOpenAI with process.env.OPENROUTER_API_KEY. Without this environment variable, the tests will fail immediately. Since there's no CI configuration providing this key and no gating mechanism, these tests will block CI runs.

Gate these tests to run only when the API key is present, or inject a mock model for CI environments:

🔧 Example gating pattern (apply to all describe blocks)
+const hasOpenRouterKey = Boolean(process.env.OPENROUTER_API_KEY);
+const describeLLM = hasOpenRouterKey ? describe : describe.skip;

-describe("Scenario Generation", () => {
+describeLLM("Scenario Generation", () => {
   it("should generate diverse messages for EXPRESS_WANT need", async () => {
     ...
   });
 });
🤖 Prompt for AI Agents
In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts` around lines
191 - 447, The tests call real LLMs (via ScenarioGenerator,
NeedFulfillmentEvaluator, NeedFulfillmentTest and ChatOpenAI) and will fail in
CI without OPENROUTER_API_KEY; update the spec to gate or mock these calls by
checking process.env.OPENROUTER_API_KEY in each top-level describe (or in a
shared beforeAll) and skip the suite when absent, or inject a mock model/agent
from createChatAgentAdapter/runNeedFulfillmentTest/runTestSuite when the env var
is missing; ensure the gating uses the unique symbols ScenarioGenerator,
runNeedFulfillmentTest, runTestSuite, createChatAgentAdapter, and ChatOpenAI so
tests either use a deterministic mock implementation in CI or are skipped when
the API key is not available.

Comment on lines +939 to +945
if (options?.parallel) {
const promises = scenarios.map((s) => runNeedFulfillmentTest(s, chatAgent, testOptions));
results.push(...(await Promise.all(promises)));
} else {
for (const scenario of scenarios) {
results.push(await runNeedFulfillmentTest(scenario, chatAgent, testOptions));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Parallel runTestSuite shares a mutable chatAgent and can corrupt results.
runNeedFulfillmentTest resets and mutates the shared agent; running it in Promise.all races state across scenarios. Consider requiring a factory that returns a fresh agent per scenario, or disable parallel mode.

🧯 Minimal safety guard
-  if (options?.parallel) {
-    const promises = scenarios.map((s) => runNeedFulfillmentTest(s, chatAgent, testOptions));
-    results.push(...(await Promise.all(promises)));
-  } else {
-    for (const scenario of scenarios) {
-      results.push(await runNeedFulfillmentTest(scenario, chatAgent, testOptions));
-    }
-  }
+  if (options?.parallel) {
+    throw new Error("Parallel execution requires a dedicated ChatAgent per scenario");
+  }
+  for (const scenario of scenarios) {
+    results.push(await runNeedFulfillmentTest(scenario, chatAgent, testOptions));
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (options?.parallel) {
const promises = scenarios.map((s) => runNeedFulfillmentTest(s, chatAgent, testOptions));
results.push(...(await Promise.all(promises)));
} else {
for (const scenario of scenarios) {
results.push(await runNeedFulfillmentTest(scenario, chatAgent, testOptions));
}
if (options?.parallel) {
throw new Error("Parallel execution requires a dedicated ChatAgent per scenario");
}
for (const scenario of scenarios) {
results.push(await runNeedFulfillmentTest(scenario, chatAgent, testOptions));
}
🤖 Prompt for AI Agents
In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts` around lines 939 -
945, The parallel branch uses a shared mutable chatAgent with
runNeedFulfillmentTest which resets/mutates the agent and causes race
conditions; change the API or call site to supply a fresh agent per scenario
(e.g., accept an agentFactory() instead of chatAgent, or clone/initialize a new
agent for each scenario inside the parallel branch) and update the parallel path
to call agentFactory() for each Promise, or alternatively disable/throw when
options?.parallel is true and no factory is provided; refer to
runNeedFulfillmentTest and the local chatAgent variable to locate where to
inject the factory/clone logic or the validation that prevents unsafe parallel
execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant