Add chat evaluation reports and evaluator implementation#387
Add chat evaluation reports and evaluator implementation#387serefyarar wants to merge 1 commit intodevfrom
Conversation
Added new chat evaluation reports and results in the protocol directory, including markdown and JSON files for multiple scenarios. Introduced chat evaluator implementation and corresponding tests under src/lib/protocol/graphs/chat, providing the core logic for running and testing chat evaluation flows.
📝 WalkthroughWalkthroughA comprehensive chat agent evaluation framework is introduced with scenario generation, need fulfillment assessment, simulated user interaction, and test orchestration that produces structured reports and aggregated metrics from multiple evaluation runs. Changes
Sequence DiagramsequenceDiagram
participant Main as Evaluation Main
participant ScenarioGen as ScenarioGenerator
participant ChatAgent as ChatAgentAdapter
participant SimUser as SimulatedUser
participant Evaluator as NeedFulfillmentEvaluator
participant DB as Mock Database
participant Reporter as ReportGenerator
Main->>ScenarioGen: generateMessage(need, persona)
ScenarioGen-->>Main: generatedMessage
Main->>SimUser: new SimulatedUser(scenario)
SimUser-->>Main: initialized
Main->>ChatAgent: reset()
ChatAgent->>DB: initialize state
loop Conversation Turns
Main->>SimUser: getInitialMessage()
SimUser-->>Main: userMessage
Main->>ChatAgent: chat(userMessage)
ChatAgent->>DB: execute graph operations
DB-->>ChatAgent: response + toolsUsed
ChatAgent-->>Main: response, toolsUsed
Main->>SimUser: respond(assistantMessage)
SimUser-->>Main: decision to continue
end
Main->>Evaluator: evaluate(scenario, conversation, toolsUsed)
Evaluator-->>Main: evaluation result (verdict, score, signals)
Main->>Reporter: generateReport(results)
Reporter-->>Main: markdown report
Main->>Reporter: generateConversationsLog(results)
Reporter-->>Main: conversation log
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@protocol/eval-reports/chat-eval-report-2026-02-03T02-01-12.md`:
- Around line 827-833: The recommendations list generator is producing
non-sequential numbering (1,2,4) because the code that assembles or renders the
recommendations list skips an index; locate the evaluator function responsible
for producing the recommendations block (e.g., generateRecommendations,
buildRecommendationsList, or format_recommendations) and fix the indexing logic
so it emits sequential numbers: ensure the counter is initialized once and
incremented for each recommendation, avoid using sparse keys or filtered arrays
that preserve original indices without reindexing, and update the renderer to
enumerate items by their current position rather than original IDs so item 3 is
not omitted.
In `@protocol/eval-reports/chat-eval-report-2026-02-03T02-06-25.md`:
- Around line 786-792: The numbered list under the "Recommendations" header
currently starts at 2 and skips 1 (see items beginning with "2. **Improve
discovery error handling:**" etc.); update the report formatter or the markdown
content so the recommendations are sequentially numbered beginning at 1 (rename
"2." → "1.", "3." → "2.", "4." → "3."), ensure any cross-references or IDs
linked to these recommendation items are updated accordingly, and run the
formatter that generates this section (the function/component that emits the
"Recommendations" block) to prevent future off-by-one numbering regressions.
In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts`:
- Around line 658-681: The generated markdown fenced code blocks are missing a
language identifier which triggers MD040; update the places that push the
opening fence (currently lines.push("```")) to use a text identifier
(lines.push("```text")) where the conversation block is built (look for the code
that inspects result.conversation and pushes "```" before and after the loop) so
both the empty-conversation branch and the loop branch emit "```text" as the
opening fence while keeping the closing fence as "```".
- Around line 734-741: The table incorrectly labels stats.toolUsage as "Times
Called" even though stats.toolUsage is built from per-scenario unique
metadata.toolsUsed; update the presentation or the metric: either rename the
column header to something like "Scenarios Used" (change the string literal in
chat.evaluator.run.ts where the header row is built) to accurately reflect the
data, or change the aggregation that populates stats.toolUsage to count every
invocation (modify the code that collects metadata.toolsUsed into
stats.toolUsage to increment per-call rather than per-scenario) and ensure the
variable name and sorting still match the rest of the logic; update any
references to stats.toolUsage accordingly.
In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts`:
- Around line 191-447: The tests call real LLMs (via ScenarioGenerator,
NeedFulfillmentEvaluator, NeedFulfillmentTest and ChatOpenAI) and will fail in
CI without OPENROUTER_API_KEY; update the spec to gate or mock these calls by
checking process.env.OPENROUTER_API_KEY in each top-level describe (or in a
shared beforeAll) and skip the suite when absent, or inject a mock model/agent
from createChatAgentAdapter/runNeedFulfillmentTest/runTestSuite when the env var
is missing; ensure the gating uses the unique symbols ScenarioGenerator,
runNeedFulfillmentTest, runTestSuite, createChatAgentAdapter, and ChatOpenAI so
tests either use a deterministic mock implementation in CI or are skipped when
the API key is not available.
In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts`:
- Around line 939-945: The parallel branch uses a shared mutable chatAgent with
runNeedFulfillmentTest which resets/mutates the agent and causes race
conditions; change the API or call site to supply a fresh agent per scenario
(e.g., accept an agentFactory() instead of chatAgent, or clone/initialize a new
agent for each scenario inside the parallel branch) and update the parallel path
to call agentFactory() for each Promise, or alternatively disable/throw when
options?.parallel is true and no factory is provided; refer to
runNeedFulfillmentTest and the local chatAgent variable to locate where to
inject the factory/clone logic or the validation that prevents unsafe parallel
execution.
🧹 Nitpick comments (4)
protocol/chat-eval-results-2026-02-03T01-42-49-355Z.json (1)
1-909: Inconsistent file location for evaluation results.This JSON results file is placed in
protocol/root, while other evaluation artifacts (likechat-eval-results-2026-02-03T02-06-25.json) are inprotocol/eval-reports/. Consider moving this file toprotocol/eval-reports/for consistent organization.Additionally, consider whether these timestamped evaluation result files should be committed to the repository at all. If they are generated artifacts from running the evaluator, they may be better suited for:
- A
.gitignoreentry (if they're local development artifacts)- A CI artifact storage (if they're needed for historical tracking)
protocol/eval-reports/chat-eval-conversations-2026-02-03T02-06-25.md (1)
29-40: Consider adding language specifier to fenced code blocks.The conversation transcript code blocks lack a language specifier, triggering markdownlint warnings. While the content renders correctly, adding
textorplaintextas the language would satisfy linters and improve consistency.Example change:
-``` +```text 👤 USER: I'm looking to hire ML engineers for my startup ...This pattern applies to all 24 conversation blocks throughout the file.
protocol/chat-eval-report-2026-02-03T01-42-49-355Z.md (1)
1-804: Inconsistent file location for evaluation report.This report file is located in
protocol/root, while similar reports (e.g.,chat-eval-report-2026-02-03T02-06-25.md) are inprotocol/eval-reports/. Move this file toprotocol/eval-reports/for consistent organization.protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts (1)
27-94: Align mock intent shape with origin fields and tighten typing.
The mock state relies heavily onanycasts andas unknown as, which weakens strict mode, and the created intent lacks origin tracking fields. This can mask schema-dependent behavior in tests/harness runs. Consider a typed mock state and includesourceType/sourceIdin created intents.🔧 Suggested adjustment for intent origin fields
const intent = { id: `intent-${Date.now()}-${Math.random().toString(36).slice(2)}`, payload: data.payload, summary: data.payload.slice(0, 100), userId: data.userId, createdAt: new Date(), updatedAt: new Date(), deletedAt: null, isIncognito: false, + sourceType: data.sourceType ?? "test", + sourceId: data.sourceId ?? data.userId, };As per coding guidelines, use strict TypeScript mode for all code and intents must track their origin via polymorphic sourceType and sourceId fields.
| ## Recommendations | ||
|
|
||
| 1. **Rebalance tool selection:** The agent calls `find_opportunities` much more than `create_intent`. When users express a want/need, the agent should first capture it as an intent before searching. | ||
|
|
||
| 2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear). | ||
|
|
||
| 4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic. |
There was a problem hiding this comment.
Recommendations numbering skips item 3.
The recommendations are numbered 1, 2, 4 - missing item 3. This is likely a bug in the report generation logic that should be fixed in the evaluator code.
Proposed fix
## Recommendations
1. **Rebalance tool selection:** The agent calls `find_opportunities` much more than `create_intent`. When users express a want/need, the agent should first capture it as an intent before searching.
2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).
-4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.
+3. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## Recommendations | |
| 1. **Rebalance tool selection:** The agent calls `find_opportunities` much more than `create_intent`. When users express a want/need, the agent should first capture it as an intent before searching. | |
| 2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear). | |
| 4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic. | |
| ## Recommendations | |
| 1. **Rebalance tool selection:** The agent calls `find_opportunities` much more than `create_intent`. When users express a want/need, the agent should first capture it as an intent before searching. | |
| 2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear). | |
| 3. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic. |
🤖 Prompt for AI Agents
In `@protocol/eval-reports/chat-eval-report-2026-02-03T02-01-12.md` around lines
827 - 833, The recommendations list generator is producing non-sequential
numbering (1,2,4) because the code that assembles or renders the recommendations
list skips an index; locate the evaluator function responsible for producing the
recommendations block (e.g., generateRecommendations, buildRecommendationsList,
or format_recommendations) and fix the indexing logic so it emits sequential
numbers: ensure the counter is initialized once and incremented for each
recommendation, avoid using sparse keys or filtered arrays that preserve
original indices without reindexing, and update the renderer to enumerate items
by their current position rather than original IDs so item 3 is not omitted.
| ## Recommendations | ||
|
|
||
| 2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear). | ||
|
|
||
| 3. **Distinguish offers from wants:** The agent treats user offers (e.g., 'I do freelance work') the same as wants. It should recognize offers and create appropriate intents. | ||
|
|
||
| 4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic. |
There was a problem hiding this comment.
Recommendations numbering is incorrect.
The recommendations section starts at item 2 and skips item 1. This appears to be a generation bug in the report formatter.
Proposed fix
## Recommendations
-2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).
+1. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear).
-3. **Distinguish offers from wants:** The agent treats user offers (e.g., 'I do freelance work') the same as wants. It should recognize offers and create appropriate intents.
+2. **Distinguish offers from wants:** The agent treats user offers (e.g., 'I do freelance work') the same as wants. It should recognize offers and create appropriate intents.
-4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.
+3. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## Recommendations | |
| 2. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear). | |
| 3. **Distinguish offers from wants:** The agent treats user offers (e.g., 'I do freelance work') the same as wants. It should recognize offers and create appropriate intents. | |
| 4. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic. | |
| ## Recommendations | |
| 1. **Improve discovery error handling:** Most discovery scenarios fail. The agent should gracefully handle empty results and offer alternatives (e.g., create an intent to be notified when matches appear). | |
| 2. **Distinguish offers from wants:** The agent treats user offers (e.g., 'I do freelance work') the same as wants. It should recognize offers and create appropriate intents. | |
| 3. **Improve intent verification:** Some valid user intents are being dropped by the semantic verifier. Review the ASSERTIVE classification logic. |
🤖 Prompt for AI Agents
In `@protocol/eval-reports/chat-eval-report-2026-02-03T02-06-25.md` around lines
786 - 792, The numbered list under the "Recommendations" header currently starts
at 2 and skips 1 (see items beginning with "2. **Improve discovery error
handling:**" etc.); update the report formatter or the markdown content so the
recommendations are sequentially numbered beginning at 1 (rename "2." → "1.",
"3." → "2.", "4." → "3."), ensure any cross-references or IDs linked to these
recommendation items are updated accordingly, and run the formatter that
generates this section (the function/component that emits the "Recommendations"
block) to prevent future off-by-one numbering regressions.
| lines.push("**Conversation:**"); | ||
| lines.push(""); | ||
| lines.push("```"); | ||
|
|
||
| if (result.conversation.length === 0) { | ||
| lines.push("(No conversation recorded)"); | ||
| } else { | ||
| for (const turn of result.conversation) { | ||
| const prefix = turn.role === "user" ? "👤 USER" : "🤖 ASSISTANT"; | ||
| lines.push(`${prefix}:`); | ||
| // Wrap long lines | ||
| const content = turn.content.split("\n").map(line => { | ||
| if (line.length > 100) { | ||
| return line.match(/.{1,100}/g)?.join("\n ") || line; | ||
| } | ||
| return line; | ||
| }).join("\n"); | ||
| lines.push(content); | ||
| lines.push(""); | ||
| } | ||
| } | ||
|
|
||
| lines.push("```"); | ||
| lines.push(""); |
There was a problem hiding this comment.
Add language identifiers to fenced code blocks.
Generated reports/logs emit fenced blocks without a language, triggering MD040 warnings. Consider ```text for these snippets.
🔧 Suggested change
- lines.push("```");
+ lines.push("```text");
if (result.conversation.length === 0) {
lines.push("(No conversation recorded)");
} else {
for (const turn of result.conversation) {
const prefix = turn.role === "user" ? "👤 USER" : "🤖 ASSISTANT";
lines.push(`${prefix}:`);
...
}
}
- lines.push("```");
+ lines.push("```");- lines.push("```");
+ lines.push("```text");
for (const turn of result.conversation) {
const prefix = turn.role === "user" ? "👤 USER" : "🤖 ASSISTANT";
lines.push(`${prefix}:`);
lines.push(turn.content);
lines.push("");
}
- lines.push("```");
+ lines.push("```");🤖 Prompt for AI Agents
In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts` around lines 658
- 681, The generated markdown fenced code blocks are missing a language
identifier which triggers MD040; update the places that push the opening fence
(currently lines.push("```")) to use a text identifier (lines.push("```text"))
where the conversation block is built (look for the code that inspects
result.conversation and pushes "```" before and after the loop) so both the
empty-conversation branch and the loop branch emit "```text" as the opening
fence while keeping the closing fence as "```".
| // Tool Usage | ||
| lines.push("## Tool Usage Patterns"); | ||
| lines.push(""); | ||
| lines.push("| Tool | Times Called |"); | ||
| lines.push("|------|--------------|"); | ||
| for (const [tool, count] of Object.entries(stats.toolUsage).sort((a, b) => b[1] - a[1])) { | ||
| lines.push(`| ${tool} | ${count} |`); | ||
| } |
There was a problem hiding this comment.
Tool usage is deduped per scenario but labeled as “Times Called.”
toolUsage counts unique tools per scenario (from metadata.toolsUsed), not actual call counts. Either track raw call counts or rename the column to avoid misleading metrics.
📝 Minimal label fix
- lines.push("| Tool | Times Called |");
- lines.push("|------|--------------|");
+ lines.push("| Tool | Scenarios Used |");
+ lines.push("|------|----------------|");🤖 Prompt for AI Agents
In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.run.ts` around lines 734
- 741, The table incorrectly labels stats.toolUsage as "Times Called" even
though stats.toolUsage is built from per-scenario unique metadata.toolsUsed;
update the presentation or the metric: either rename the column header to
something like "Scenarios Used" (change the string literal in
chat.evaluator.run.ts where the header row is built) to accurately reflect the
data, or change the aggregation that populates stats.toolUsage to count every
invocation (modify the code that collects metadata.toolsUsed into
stats.toolUsage to increment per-call rather than per-scenario) and ensure the
variable name and sorting still match the rest of the logic; update any
references to stats.toolUsage accordingly.
| describe("Scenario Generation", () => { | ||
| it("should generate diverse messages for EXPRESS_WANT need", async () => { | ||
| const generator = new ScenarioGenerator(); | ||
| const scenarios = await generator.generateScenariosForNeed("EXPRESS_WANT", 3); | ||
|
|
||
| expect(scenarios.length).toBe(3); | ||
|
|
||
| // Each scenario should have different persona | ||
| const personas = new Set(scenarios.map((s) => s.persona.id)); | ||
| expect(personas.size).toBeGreaterThan(1); | ||
|
|
||
| // Messages should be non-empty and different | ||
| const messages = scenarios.map((s) => s.generatedMessage); | ||
| expect(messages.every((m) => m.length > 5)).toBe(true); | ||
|
|
||
| console.log("\nGenerated EXPRESS_WANT scenarios:"); | ||
| for (const s of scenarios) { | ||
| console.log(` [${s.persona.id}]: "${s.generatedMessage}"`); | ||
| } | ||
| }, 60000); | ||
|
|
||
| it("should generate journey scenarios with context progression", async () => { | ||
| const generator = new ScenarioGenerator(); | ||
| const scenarios = await generator.generateJourneyScenario("ONBOARDING_FLOW", "NEW_USER"); | ||
|
|
||
| expect(scenarios.length).toBe(3); // ESTABLISH_PRESENCE, EXPRESS_WANT, FIND_PEOPLE | ||
|
|
||
| // Context should evolve | ||
| expect(scenarios[0].context.hasProfile).toBe(false); | ||
| expect(scenarios[1].context.hasProfile).toBe(true); // After ESTABLISH_PRESENCE | ||
|
|
||
| console.log("\nGenerated ONBOARDING_FLOW journey:"); | ||
| for (const s of scenarios) { | ||
| console.log(` [${s.need.id}]: "${s.generatedMessage}"`); | ||
| } | ||
| }, 90000); | ||
| }); | ||
|
|
||
| // ═══════════════════════════════════════════════════════════════════════════════ | ||
| // SINGLE NEED FULFILLMENT TESTS | ||
| // ═══════════════════════════════════════════════════════════════════════════════ | ||
|
|
||
| describe("Need Fulfillment - Single Needs", () => { | ||
| let database: ChatGraphCompositeDatabase; | ||
| let chatAgent: ChatAgentInterface; | ||
| let generator: ScenarioGenerator; | ||
|
|
||
| beforeAll(() => { | ||
| database = createStatefulMockDatabase(); | ||
| chatAgent = createChatAgentAdapter(database); | ||
| generator = new ScenarioGenerator(); | ||
| }); | ||
|
|
||
| it("should fulfill EXPRESS_WANT need", async () => { | ||
| const scenarios = await generator.generateScenariosForNeed("EXPRESS_WANT", 1); | ||
| const result = await runNeedFulfillmentTest(scenarios[0], chatAgent, { | ||
| verbose: true, | ||
| maxTurns: 3, | ||
| timeoutMs: 90000, | ||
| }); | ||
|
|
||
| console.log("\n=== EXPRESS_WANT Result ==="); | ||
| console.log(`Verdict: ${result.evaluation.overallVerdict}`); | ||
| console.log(`Score: ${result.evaluation.fulfillmentScore}`); | ||
| console.log(`Reasoning: ${result.evaluation.reasoning}`); | ||
| console.log(`Tools: ${result.metadata.toolsUsed.join(", ")}`); | ||
|
|
||
| // We expect success or partial - the agent should at least try | ||
| expect(["success", "partial"]).toContain(result.evaluation.overallVerdict); | ||
| }, 120000); | ||
|
|
||
| it("should fulfill FIND_PEOPLE need", async () => { | ||
| const scenarios = await generator.generateScenariosForNeed("FIND_PEOPLE", 1); | ||
| const result = await runNeedFulfillmentTest(scenarios[0], chatAgent, { | ||
| verbose: true, | ||
| maxTurns: 3, | ||
| timeoutMs: 90000, | ||
| }); | ||
|
|
||
| console.log("\n=== FIND_PEOPLE Result ==="); | ||
| console.log(`Verdict: ${result.evaluation.overallVerdict}`); | ||
| console.log(`Reasoning: ${result.evaluation.reasoning}`); | ||
|
|
||
| // Agent should use discovery tool | ||
| expect(result.metadata.toolsUsed.length).toBeGreaterThan(0); | ||
| }, 120000); | ||
|
|
||
| it("should handle UNDERSTAND_SYSTEM need without tools", async () => { | ||
| const scenarios = await generator.generateScenariosForNeed("UNDERSTAND_SYSTEM", 1); | ||
| const result = await runNeedFulfillmentTest(scenarios[0], chatAgent, { | ||
| verbose: true, | ||
| maxTurns: 2, | ||
| timeoutMs: 60000, | ||
| }); | ||
|
|
||
| console.log("\n=== UNDERSTAND_SYSTEM Result ==="); | ||
| console.log(`Verdict: ${result.evaluation.overallVerdict}`); | ||
| console.log(`Reasoning: ${result.evaluation.reasoning}`); | ||
|
|
||
| // Should have a conversation at minimum | ||
| expect(result.conversation.length).toBeGreaterThan(0); | ||
| }, 90000); | ||
| }); | ||
|
|
||
| // ═══════════════════════════════════════════════════════════════════════════════ | ||
| // PERSONA VARIATION TESTS | ||
| // ═══════════════════════════════════════════════════════════════════════════════ | ||
|
|
||
| describe("Need Fulfillment - Persona Variations", () => { | ||
| let database: ChatGraphCompositeDatabase; | ||
| let chatAgent: ChatAgentInterface; | ||
| let generator: ScenarioGenerator; | ||
|
|
||
| beforeAll(() => { | ||
| database = createStatefulMockDatabase(); | ||
| chatAgent = createChatAgentAdapter(database); | ||
| generator = new ScenarioGenerator(); | ||
| }); | ||
|
|
||
| const testPersonas: UserPersonaId[] = ["BUSY_FOUNDER", "VAGUE_USER", "NON_NATIVE_SPEAKER"]; | ||
|
|
||
| for (const personaId of testPersonas) { | ||
| it(`should handle ${personaId} persona for EXPRESS_WANT`, async () => { | ||
| const need = USER_NEEDS.EXPRESS_WANT; | ||
| const persona = USER_PERSONAS[personaId]; | ||
|
|
||
| const scenario: GeneratedScenario = { | ||
| id: `test-${personaId}`, | ||
| need, | ||
| persona, | ||
| context: { hasProfile: true, hasIntents: false, isIndexOwner: false }, | ||
| generatedMessage: await generator.generateMessage(need, persona, { | ||
| hasProfile: true, | ||
| hasIntents: false, | ||
| isIndexOwner: false, | ||
| }), | ||
| evaluationCriteria: { | ||
| needFulfilled: need.description, | ||
| successSignals: need.successSignals, | ||
| failureSignals: need.failureSignals, | ||
| qualityFactors: ["Adapted to user's communication style"], | ||
| }, | ||
| }; | ||
|
|
||
| const result = await runNeedFulfillmentTest(scenario, chatAgent, { | ||
| verbose: true, | ||
| maxTurns: 3, | ||
| timeoutMs: 90000, | ||
| }); | ||
|
|
||
| console.log(`\n=== ${personaId} Result ===`); | ||
| console.log(`Message: "${scenario.generatedMessage}"`); | ||
| console.log(`Verdict: ${result.evaluation.overallVerdict}`); | ||
| console.log(`Quality: ${result.evaluation.qualityScore}`); | ||
|
|
||
| // Agent should handle all personas - conversation should happen | ||
| expect(result.conversation.length).toBeGreaterThan(0); | ||
| }, 120000); | ||
| } | ||
| }); | ||
|
|
||
| // ═══════════════════════════════════════════════════════════════════════════════ | ||
| // JOURNEY TESTS | ||
| // ═══════════════════════════════════════════════════════════════════════════════ | ||
|
|
||
| describe("Need Fulfillment - User Journeys", () => { | ||
| let generator: ScenarioGenerator; | ||
|
|
||
| beforeAll(() => { | ||
| generator = new ScenarioGenerator(); | ||
| }); | ||
|
|
||
| it("should complete ONBOARDING_FLOW journey", async () => { | ||
| const database = createStatefulMockDatabase(); | ||
| const chatAgent = createChatAgentAdapter(database); | ||
|
|
||
| const scenarios = await generator.generateJourneyScenario("ONBOARDING_FLOW", "NEW_USER"); | ||
|
|
||
| console.log("\n=== ONBOARDING_FLOW Journey ==="); | ||
|
|
||
| const results = []; | ||
| for (const scenario of scenarios) { | ||
| // Don't reset between journey steps - maintain context | ||
| const result = await runNeedFulfillmentTest(scenario, chatAgent, { | ||
| verbose: true, | ||
| maxTurns: 2, | ||
| timeoutMs: 60000, | ||
| }); | ||
| results.push(result); | ||
| } | ||
|
|
||
| // At least some conversations should happen | ||
| const withConversation = results.filter((r) => r.conversation.length > 0).length; | ||
| console.log(`\nJourney conversations: ${withConversation}/${results.length}`); | ||
|
|
||
| expect(withConversation).toBeGreaterThanOrEqual(2); | ||
| }, 240000); | ||
|
|
||
| it("should complete INTENT_LIFECYCLE journey", async () => { | ||
| const database = createStatefulMockDatabase(); | ||
| const chatAgent = createChatAgentAdapter(database); | ||
|
|
||
| const scenarios = await generator.generateJourneyScenario("INTENT_LIFECYCLE", "POWER_USER"); | ||
|
|
||
| console.log("\n=== INTENT_LIFECYCLE Journey ==="); | ||
|
|
||
| const results = []; | ||
| for (const scenario of scenarios) { | ||
| const result = await runNeedFulfillmentTest(scenario, chatAgent, { | ||
| verbose: true, | ||
| maxTurns: 2, | ||
| timeoutMs: 60000, | ||
| }); | ||
| results.push(result); | ||
| } | ||
|
|
||
| const withConversation = results.filter((r) => r.conversation.length > 0).length; | ||
| console.log(`\nJourney conversations: ${withConversation}/${results.length}`); | ||
|
|
||
| expect(withConversation).toBeGreaterThanOrEqual(2); | ||
| }, 240000); | ||
| }); | ||
|
|
||
| // ═══════════════════════════════════════════════════════════════════════════════ | ||
| // COMPREHENSIVE TEST SUITE | ||
| // ═══════════════════════════════════════════════════════════════════════════════ | ||
|
|
||
| describe("Comprehensive Test Suite", () => { | ||
| it("should run full test suite across all needs", async () => { | ||
| const database = createStatefulMockDatabase(); | ||
| const chatAgent = createChatAgentAdapter(database); | ||
| const generator = new ScenarioGenerator(); | ||
|
|
||
| // Only test 2 needs for speed | ||
| const testNeeds: UserNeedId[] = ["EXPRESS_WANT", "UNDERSTAND_SYSTEM"]; | ||
|
|
||
| const allScenarios: GeneratedScenario[] = []; | ||
| for (const needId of testNeeds) { | ||
| const scenarios = await generator.generateScenariosForNeed(needId, 1); | ||
| allScenarios.push(...scenarios); | ||
| } | ||
|
|
||
| console.log(`\nRunning ${allScenarios.length} scenarios...`); | ||
|
|
||
| const { results, summary } = await runTestSuite(allScenarios, chatAgent, { verbose: false }); | ||
|
|
||
| console.log("\n=== TEST SUITE SUMMARY ==="); | ||
| console.log(`Total: ${summary.total}`); | ||
| console.log(`Success: ${summary.success}`); | ||
| console.log(`Partial: ${summary.partial}`); | ||
| console.log(`Failure: ${summary.failure}`); | ||
| console.log(`Blocked: ${summary.blocked}`); | ||
|
|
||
| // At least some conversations should complete | ||
| expect(results.some((r) => r.conversation.length > 0)).toBe(true); | ||
| }, 300000); | ||
| }); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find ScenarioGenerator and NeedFulfillmentEvaluator implementations
fd -t f "\.ts$" | xargs rg "class ScenarioGenerator|class NeedFulfillmentEvaluator" -lRepository: indexnetwork/index
Length of output: 117
🏁 Script executed:
# Look for the spec file and related test utilities
fd -t f "chat\.evaluator\.spec\.ts"Repository: indexnetwork/index
Length of output: 122
🏁 Script executed:
# Check imports in the test file to understand dependencies
head -50 protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.tsRepository: indexnetwork/index
Length of output: 1672
🏁 Script executed:
# Search for createStatefulMockDatabase and related mock utilities
rg "createStatefulMockDatabase|createChatAgentAdapter" -lRepository: indexnetwork/index
Length of output: 182
🏁 Script executed:
# Check CI configuration
fd -t f "\.(yml|yaml|json)$" | xargs rg "OPENROUTER_API_KEY" -l 2>/dev/null || echo "No CI files found with OPENROUTER_API_KEY"Repository: indexnetwork/index
Length of output: 103
🏁 Script executed:
# Get line count of chat.evaluator.ts to understand its size
wc -l protocol/src/lib/protocol/graphs/chat/chat.evaluator.tsRepository: indexnetwork/index
Length of output: 121
🏁 Script executed:
# Check ScenarioGenerator class definition and key methods
rg "class ScenarioGenerator|async generateScenariosForNeed|async generateMessage" protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts -A 10Repository: indexnetwork/index
Length of output: 1281
🏁 Script executed:
# Look for LLM-related imports (OpenRouter, ChatOpenAI, etc.) in chat.evaluator.ts
rg "import.*openrouter|import.*ChatOpenAI|import.*LLM|from.*langchain" protocol/src/lib/protocol/graphs/chat/chat.evaluator.tsRepository: indexnetwork/index
Length of output: 181
🏁 Script executed:
# Check for environment variable usage in chat.evaluator.ts
rg "OPENROUTER_API_KEY|process\.env\." protocol/src/lib/protocol/graphs/chat/chat.evaluator.tsRepository: indexnetwork/index
Length of output: 457
🏁 Script executed:
# Check if there's any describe.skip or test gating in the spec file
rg "describe\.skip|it\.skip|process\.env\.|hasOpenRouterKey" protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.tsRepository: indexnetwork/index
Length of output: 44
🏁 Script executed:
# Look for any mock implementations or jest mock setup
rg "jest\.mock|vi\.mock|mock|Mock" protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.tsRepository: indexnetwork/index
Length of output: 691
🏁 Script executed:
# Check for .env.development or other env handling at start of spec
head -20 protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.tsRepository: indexnetwork/index
Length of output: 629
🏁 Script executed:
# Verify NeedFulfillmentEvaluator also uses LLM
rg "class NeedFulfillmentEvaluator" protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts -A 15Repository: indexnetwork/index
Length of output: 475
🏁 Script executed:
# Check git history or comments about test intent (integration vs unit)
rg "integration|unit test|TODO|FIXME" protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts -B 2Repository: indexnetwork/index
Length of output: 44
🏁 Script executed:
# Check if there are any .env files in the repo
find . -name ".env*" -type f 2>/dev/null | head -20Repository: indexnetwork/index
Length of output: 44
🏁 Script executed:
# Look for CI configuration files (GitHub Actions, etc.)
fd -t f "(\.github|\.gitlab-ci|\.circleci|jenkins|travis)" | head -20Repository: indexnetwork/index
Length of output: 44
🏁 Script executed:
# Check if GitHub Actions workflow references these test files
find .github -name "*.yml" -o -name "*.yaml" 2>/dev/null | xargs cat 2>/dev/null | rg "chat\.evaluator|OPENROUTER" || echo "No GitHub Actions workflows found"Repository: indexnetwork/index
Length of output: 95
🏁 Script executed:
# Verify that env loading happens at top level and will fail if key missing
sed -n '1,20p' protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.tsRepository: indexnetwork/index
Length of output: 629
🏁 Script executed:
# Check how ChatOpenAI is instantiated - will it throw if apiKey is undefined?
rg "new ChatOpenAI" protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts -B 2 -A 8Repository: indexnetwork/index
Length of output: 1061
🏁 Script executed:
# Verify if tests will fail when OPENROUTER_API_KEY is missing
rg "generateMessage|generateScenariosForNeed" protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts -A 5 | head -40Repository: indexnetwork/index
Length of output: 977
Gate or mock live LLM calls in tests to avoid CI failures.
These tests make direct external LLM calls via ScenarioGenerator, NeedFulfillmentEvaluator, and NeedFulfillmentTest, which instantiate ChatOpenAI with process.env.OPENROUTER_API_KEY. Without this environment variable, the tests will fail immediately. Since there's no CI configuration providing this key and no gating mechanism, these tests will block CI runs.
Gate these tests to run only when the API key is present, or inject a mock model for CI environments:
🔧 Example gating pattern (apply to all describe blocks)
+const hasOpenRouterKey = Boolean(process.env.OPENROUTER_API_KEY);
+const describeLLM = hasOpenRouterKey ? describe : describe.skip;
-describe("Scenario Generation", () => {
+describeLLM("Scenario Generation", () => {
it("should generate diverse messages for EXPRESS_WANT need", async () => {
...
});
});🤖 Prompt for AI Agents
In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.spec.ts` around lines
191 - 447, The tests call real LLMs (via ScenarioGenerator,
NeedFulfillmentEvaluator, NeedFulfillmentTest and ChatOpenAI) and will fail in
CI without OPENROUTER_API_KEY; update the spec to gate or mock these calls by
checking process.env.OPENROUTER_API_KEY in each top-level describe (or in a
shared beforeAll) and skip the suite when absent, or inject a mock model/agent
from createChatAgentAdapter/runNeedFulfillmentTest/runTestSuite when the env var
is missing; ensure the gating uses the unique symbols ScenarioGenerator,
runNeedFulfillmentTest, runTestSuite, createChatAgentAdapter, and ChatOpenAI so
tests either use a deterministic mock implementation in CI or are skipped when
the API key is not available.
| if (options?.parallel) { | ||
| const promises = scenarios.map((s) => runNeedFulfillmentTest(s, chatAgent, testOptions)); | ||
| results.push(...(await Promise.all(promises))); | ||
| } else { | ||
| for (const scenario of scenarios) { | ||
| results.push(await runNeedFulfillmentTest(scenario, chatAgent, testOptions)); | ||
| } |
There was a problem hiding this comment.
Parallel runTestSuite shares a mutable chatAgent and can corrupt results.
runNeedFulfillmentTest resets and mutates the shared agent; running it in Promise.all races state across scenarios. Consider requiring a factory that returns a fresh agent per scenario, or disable parallel mode.
🧯 Minimal safety guard
- if (options?.parallel) {
- const promises = scenarios.map((s) => runNeedFulfillmentTest(s, chatAgent, testOptions));
- results.push(...(await Promise.all(promises)));
- } else {
- for (const scenario of scenarios) {
- results.push(await runNeedFulfillmentTest(scenario, chatAgent, testOptions));
- }
- }
+ if (options?.parallel) {
+ throw new Error("Parallel execution requires a dedicated ChatAgent per scenario");
+ }
+ for (const scenario of scenarios) {
+ results.push(await runNeedFulfillmentTest(scenario, chatAgent, testOptions));
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (options?.parallel) { | |
| const promises = scenarios.map((s) => runNeedFulfillmentTest(s, chatAgent, testOptions)); | |
| results.push(...(await Promise.all(promises))); | |
| } else { | |
| for (const scenario of scenarios) { | |
| results.push(await runNeedFulfillmentTest(scenario, chatAgent, testOptions)); | |
| } | |
| if (options?.parallel) { | |
| throw new Error("Parallel execution requires a dedicated ChatAgent per scenario"); | |
| } | |
| for (const scenario of scenarios) { | |
| results.push(await runNeedFulfillmentTest(scenario, chatAgent, testOptions)); | |
| } |
🤖 Prompt for AI Agents
In `@protocol/src/lib/protocol/graphs/chat/chat.evaluator.ts` around lines 939 -
945, The parallel branch uses a shared mutable chatAgent with
runNeedFulfillmentTest which resets/mutates the agent and causes race
conditions; change the API or call site to supply a fresh agent per scenario
(e.g., accept an agentFactory() instead of chatAgent, or clone/initialize a new
agent for each scenario inside the parallel branch) and update the parallel path
to call agentFactory() for each Promise, or alternatively disable/throw when
options?.parallel is true and no factory is provided; refer to
runNeedFulfillmentTest and the local chatAgent variable to locate where to
inject the factory/clone logic or the validation that prevents unsafe parallel
execution.
Added new chat evaluation reports and results in the protocol directory, including markdown and JSON files for multiple scenarios. Introduced chat evaluator implementation and corresponding tests under src/lib/protocol/graphs/chat, providing the core logic for running and testing chat evaluation flows.
Summary by CodeRabbit
Release Notes
Documentation
Chores