Evaluation: Fix score format #549

AkhileshNegi · 2026-01-20T07:51:26Z

Summary

Target issue is #520

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Enhanced evaluation score handling to preserve and merge existing scores with newly fetched data, improving data consistency.
Standardized evaluation score storage format for better compatibility.

Summary by CodeRabbit

New Features
- Evaluation scores now cache and merge previous results with fresh fetches for more efficient score retrieval.
Improvements
- Reorganized evaluation API endpoints for clearer, more structured routing and documentation.
Bug Fixes
- Fixed score handling to preserve and merge summary data consistently across fetches.
Tests
- Updated tests to assert the new summary_scores format for evaluation results.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-20T07:51:35Z

📝 Walkthrough

Walkthrough

Splits the evaluations router into two direct routers (dataset and evaluation) and standardizes score storage on a summary_scores array, adding cache-then-merge logic that merges existing summary_scores with Langfuse-fetched summary_scores and persists the merged result (including traces).

Changes

Cohort / File(s)	Summary
API Routing Reorganization `backend/app/api/main.py`, `backend/app/api/routes/evaluations/__init__.py`, `backend/app/api/routes/evaluations/dataset.py`, `backend/app/api/routes/evaluations/evaluation.py`	Removes package-level `evaluations` APIRouter export; registers `dataset` and `evaluation` routers directly. `dataset` uses `prefix="/evaluations/datasets"`, `evaluation` uses `prefix="/evaluations"`, both tagged `"Evaluation"`.
Score & Trace Merge Logic `backend/app/crud/evaluations/core.py`, `backend/app/crud/evaluations/processing.py`, `backend/app/services/evaluations/evaluation.py`	Changes stored score shape to `summary_scores` array; adds cache-then-fetch behavior, extracts existing summary_scores, fetches Langfuse data, merges (Langfuse precedence), includes traces in final score, and saves merged score.
Tests Updated `backend/app/tests/crud/evaluations/test_processing.py`	Updates assertions to read `summary_scores` list and assert the `cosine_similarity` entry's `avg` value.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client/API Request
    participant API as API Router
    participant Svc as Evaluation Service
    participant CRUD as CRUD Layer
    participant DB as Database
    participant Langfuse as External API (Langfuse)

    Client->>API: GET /evaluations/:id[?get_trace_info]
    API->>Svc: get_evaluation_with_scores(params)
    Svc->>CRUD: get_evaluation_run(id)
    CRUD->>DB: Query eval_run
    DB-->>CRUD: eval_run (may include cached summary_scores/traces)
    CRUD-->>Svc: eval_run

    alt Need fetch or resync
        Svc->>Langfuse: Fetch traces & summary_scores
        Langfuse-->>Svc: langfuse_score (summary_scores + traces)
        Svc->>Svc: Extract existing_summary_scores from eval_run.score
        Svc->>Svc: Merge existing_summary_scores + langfuse_summary_scores (langfuse precedence)
        Svc->>Svc: Build final score {merged_summary_scores, traces}
        Svc->>CRUD: save_score(eval_run.id, final_score)
        CRUD->>DB: Update eval_run.score
        DB-->>CRUD: Confirm
        CRUD-->>Svc: persisted
    else Cached data sufficient
        Svc->>Svc: Return cached eval_run.score
    end

    Svc-->>Client: Evaluation with merged scores (+ traces if requested)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Kaapi v1.0: Enhancing the test suite #488 — overlaps on evaluation routing, score shape changes, and related tests.

Suggested labels

enhancement, ready-for-review

Suggested reviewers

Prajna1999

"I hopped through routers, quick and spry,
Merged old scores with Langfuse in the sky.
Traces stitched, summaries all aligned,
I nudged the paths and left no bind. 🥕"

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Evaluation: Fix score format' directly summarizes the main change—standardizing evaluation score storage format and fixing how scores are handled. It accurately reflects the core modifications across multiple evaluation-related files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-01-21T05:38:16Z

Codecov Report

❌ Patch coverage is 47.36842% with 20 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/crud/evaluations/core.py	0.00%	11 Missing ⚠️
backend/app/services/evaluations/evaluation.py	43.75%	9 Missing ⚠️

📢 Thoughts on this report? Let us know!

Prajna1999 · 2026-01-21T05:42:24Z

backend/app/crud/evaluations/core.py

    eval_run: EvaluationRun,
    langfuse: Langfuse,
    force_refetch: bool = False,
 ) -> dict[str, Any]:


Should we add strict type safety here?

AkhileshNegi added 3 commits January 20, 2026 13:05

fix score format

f8f1c9d

cleanup documentation

7c9ca37

cleanup router

60512d2

AkhileshNegi marked this pull request as ready for review January 21, 2026 04:49

Merge branch 'main' into hotfix/evaluation-score-format

7b2e3c4

AkhileshNegi linked an issue Jan 21, 2026 that may be closed by this pull request

Evaluation: Fix score format #520

Open

AkhileshNegi added 3 commits January 21, 2026 10:59

updated testcase

2b68617

updated testcase

7ac771a

updated testcase

8655b2d

AkhileshNegi requested a review from Prajna1999 January 21, 2026 05:33

AkhileshNegi added the bug Something isn't working label Jan 21, 2026

AkhileshNegi self-assigned this Jan 21, 2026

Prajna1999 approved these changes Jan 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation: Fix score format #549

Evaluation: Fix score format #549

Uh oh!

AkhileshNegi commented Jan 20, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 20, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

codecov bot commented Jan 21, 2026

Uh oh!

Prajna1999 Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Evaluation: Fix score format #549

Are you sure you want to change the base?

Evaluation: Fix score format #549

Uh oh!

Conversation

AkhileshNegi commented Jan 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

codecov bot commented Jan 21, 2026

Codecov Report

Uh oh!

Prajna1999 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AkhileshNegi commented Jan 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 20, 2026 •

edited

Loading