Skip to content

Conversation

@AkhileshNegi
Copy link
Collaborator

@AkhileshNegi AkhileshNegi commented Jan 20, 2026

Summary

Target issue is #520

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

  • Enhanced evaluation score handling to preserve and merge existing scores with newly fetched data, improving data consistency.
  • Standardized evaluation score storage format for better compatibility.

Summary by CodeRabbit

  • New Features

    • Evaluation scores now cache and merge previous results with fresh fetches for more efficient score retrieval.
  • Improvements

    • Reorganized evaluation API endpoints for clearer, more structured routing and documentation.
  • Bug Fixes

    • Fixed score handling to preserve and merge summary data consistently across fetches.
  • Tests

    • Updated tests to assert the new summary_scores format for evaluation results.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 20, 2026

📝 Walkthrough

Walkthrough

Splits the evaluations router into two direct routers (dataset and evaluation) and standardizes score storage on a summary_scores array, adding cache-then-merge logic that merges existing summary_scores with Langfuse-fetched summary_scores and persists the merged result (including traces).

Changes

Cohort / File(s) Summary
API Routing Reorganization
backend/app/api/main.py, backend/app/api/routes/evaluations/__init__.py, backend/app/api/routes/evaluations/dataset.py, backend/app/api/routes/evaluations/evaluation.py
Removes package-level evaluations APIRouter export; registers dataset and evaluation routers directly. dataset uses prefix="/evaluations/datasets", evaluation uses prefix="/evaluations", both tagged "Evaluation".
Score & Trace Merge Logic
backend/app/crud/evaluations/core.py, backend/app/crud/evaluations/processing.py, backend/app/services/evaluations/evaluation.py
Changes stored score shape to summary_scores array; adds cache-then-fetch behavior, extracts existing summary_scores, fetches Langfuse data, merges (Langfuse precedence), includes traces in final score, and saves merged score.
Tests Updated
backend/app/tests/crud/evaluations/test_processing.py
Updates assertions to read summary_scores list and assert the cosine_similarity entry's avg value.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client/API Request
    participant API as API Router
    participant Svc as Evaluation Service
    participant CRUD as CRUD Layer
    participant DB as Database
    participant Langfuse as External API (Langfuse)

    Client->>API: GET /evaluations/:id[?get_trace_info]
    API->>Svc: get_evaluation_with_scores(params)
    Svc->>CRUD: get_evaluation_run(id)
    CRUD->>DB: Query eval_run
    DB-->>CRUD: eval_run (may include cached summary_scores/traces)
    CRUD-->>Svc: eval_run

    alt Need fetch or resync
        Svc->>Langfuse: Fetch traces & summary_scores
        Langfuse-->>Svc: langfuse_score (summary_scores + traces)
        Svc->>Svc: Extract existing_summary_scores from eval_run.score
        Svc->>Svc: Merge existing_summary_scores + langfuse_summary_scores (langfuse precedence)
        Svc->>Svc: Build final score {merged_summary_scores, traces}
        Svc->>CRUD: save_score(eval_run.id, final_score)
        CRUD->>DB: Update eval_run.score
        DB-->>CRUD: Confirm
        CRUD-->>Svc: persisted
    else Cached data sufficient
        Svc->>Svc: Return cached eval_run.score
    end

    Svc-->>Client: Evaluation with merged scores (+ traces if requested)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

enhancement, ready-for-review

Suggested reviewers

  • Prajna1999

"I hopped through routers, quick and spry,
Merged old scores with Langfuse in the sky.
Traces stitched, summaries all aligned,
I nudged the paths and left no bind. 🥕"

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Evaluation: Fix score format' directly summarizes the main change—standardizing evaluation score storage format and fixing how scores are handled. It accurately reflects the core modifications across multiple evaluation-related files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@AkhileshNegi AkhileshNegi marked this pull request as ready for review January 21, 2026 04:49
@AkhileshNegi AkhileshNegi linked an issue Jan 21, 2026 that may be closed by this pull request
@AkhileshNegi AkhileshNegi added the bug Something isn't working label Jan 21, 2026
@AkhileshNegi AkhileshNegi self-assigned this Jan 21, 2026
@codecov
Copy link

codecov bot commented Jan 21, 2026

Codecov Report

❌ Patch coverage is 47.36842% with 20 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/app/crud/evaluations/core.py 0.00% 11 Missing ⚠️
backend/app/services/evaluations/evaluation.py 43.75% 9 Missing ⚠️

📢 Thoughts on this report? Let us know!

eval_run: EvaluationRun,
langfuse: Langfuse,
force_refetch: bool = False,
) -> dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add strict type safety here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluation: Fix score format

3 participants