Colelction: making the module provider agnostic #548

nishika26 · 2026-01-20T06:16:18Z

Summary

Target issue is #489

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

New Features
- Collections now include provider, name, and description; names are enforced unique per project. we check this unique-ness at the router level itself and if the name is not unique we give back an error telling the user the given name already exists
- Provider registry and an OpenAI provider added; provider-based create/delete flows available.
- Create and delete collection service is completely provider agnostic and the creation as well as deletion logic will be implemented according to the provider set by user in the input.
- Since dalgo still creates assistant through these endpoints and adding an input like provider would be breaking change for them, we internally check that if model and instructions parameter are given then provider value is sent internally and then only sent to the collection table, so they wont have to make any change as of now.
Tests
- Tests and helpers updated to exercise provider flows and new collection shape.

Summary by CodeRabbit

New Features
- Collections now support custom names and descriptions
- Unique collection names enforced per project
- Provider-agnostic architecture enables future LLM service expansion
Refactor
- Collections restructured to use project-level organization instead of organization-level
- LLM provider abstraction layer introduced for centralized service management

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-20T06:16:26Z

📝 Walkthrough

Walkthrough

Adds a provider abstraction and registry, migrates Collection schema to include provider/name/description and remove organization_id, updates create/delete flows to use providers, enforces per-project unique collection names, and updates tests/utilities to the provider-driven flow.

Changes

Cohort / File(s)	Summary
DB Migration `backend/app/alembic/versions/041_extend_collection_table_for_provider_.py`	Adds `providertype` enum and `provider` column; adds `name` and `description`; applies unique constraint on `(project_id, name)`; drops `organization_id` FK/column; includes downgrade steps.
Models & Exports `backend/app/models/collection.py`, `backend/app/models/__init__.py`, `backend/app/models/organization.py`	Adds `ProviderType`; adds `provider`, `name`, `description` to `Collection`; removes `organization_id` and Organization.collections relationship; introduces `CollectionOptions` and updates `CreationRequest`; re-exports new types.
Provider Layer `backend/app/services/collections/providers/...` `__init__.py`, `base.py`, `openai.py`, `registry.py`	New provider API: `BaseProvider` interface; `OpenAIProvider` implements create/delete; registry/`LLMProvider` and `get_llm_provider` factory to instantiate providers from credentials.
Collection Services & Helpers `backend/app/services/collections/create_collection.py`, `delete_collection.py`, `helpers.py`	Replace OpenAI-specific orchestration with provider-driven create/delete; add `get_service_name` and `ensure_unique_name`; remove legacy backout; `execute_job` includes provider handling and delegates remote ops to providers.
API & CRUD `backend/app/api/routes/collections.py`, `backend/app/crud/collection/collection.py`	`create_collection` calls `ensure_unique_name` when name provided; `CollectionCrud` adds `exists_by_name`.
Tests — Utilities `backend/app/tests/utils/collection.py`, `backend/app/tests/utils/llm_provider.py`	Rename helpers (`get_collection` → `get_assistant_collection`); add `get_vector_store_collection`; add `llm_provider` test utilities (mocks, id generators, mock provider); tests set `provider=ProviderType.openai`.
Tests — Routes & Services multiple `backend/app/tests/...`	Tests updated to use `get_llm_provider` and mocked providers; assertions switched to `get_service_name("openai")`; creation/delete flows mock `provider.create`/`provider.delete`; import paths adjusted.
Provider Tests `backend/app/tests/services/collections/providers/test_openai_provider.py`	New unit tests for `OpenAIProvider` covering vector-store-only creation, assistant creation, deletion, and error propagation.

Sequence Diagram(s)

sequenceDiagram
    participant Client as API Client
    participant Route as Collections Route
    participant Registry as Provider Registry
    participant Provider as Provider Impl
    participant Storage as Cloud Storage
    participant DB as Database

    Client->>Route: POST /collections (name, documents, provider)
    Route->>Route: ensure_unique_name(project_id, name)
    Route->>DB: enqueue create job / persist job record
    Route->>Registry: get_llm_provider(provider, project_id, org_id)
    Registry->>DB: fetch provider credentials
    Registry->>Provider: instantiate provider
    Route->>Provider: provider.create(request, storage, document_crud)
    Provider->>Provider: batch_documents()
    Provider->>Storage: create/upload vector store
    alt with assistant
        Provider->>Storage: create assistant
    end
    Provider-->>Route: return llm_service_id & llm_service_name
    Route->>DB: persist Collection (provider, name, description, ids)
    Route-->>Client: send success callback / job completion

sequenceDiagram
    participant Client as API Client
    participant Route as Collections Route
    participant Registry as Provider Registry
    participant Provider as Provider Impl
    participant Storage as Cloud Storage
    participant DB as Database

    Client->>Route: DELETE /collections/{id}
    Route->>DB: fetch Collection (provider, llm_service_name, ids)
    Route->>DB: enqueue delete job
    Route->>Registry: get_llm_provider(collection.provider, project_id, org_id)
    Registry->>DB: fetch provider credentials
    Registry->>Provider: instantiate provider
    Route->>Provider: provider.delete(collection)
    alt llm_service_name indicates assistant
        Provider->>Storage: delete assistant
    else
        Provider->>Storage: delete vector store
    end
    Provider-->>Route: deletion success
    Route->>DB: remove Collection record
    Route-->>Client: send success callback

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Collection: Adding input parameter "provider" #502 — overlaps on adding provider input and related model changes.
Kaapi v1.0: Database Comments #476 — overlaps on Collection model and DB schema alterations.
Kaapi v1.0: Enhancing the test suite #488 — related to test utilities and helper renames affecting collection tests.

Suggested reviewers

Prajna1999
AkhileshNegi

Poem

🐇 I hopped through migrations, schema in tow,
Providers lined up, ready to grow.
Names checked at the gate, no duplicates stay,
Mocked providers dance, tests join the play.
A rabbit's cheer — modular code on display.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 54.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title contains a typo ('Colelction' instead of 'Collection') and is partially related to the main change—it describes provider agnosticism, which is a key aspect, but is somewhat vague.	Fix the typo: 'Colelction' should be 'Collection'. Consider a more specific title such as 'Collection: Refactor to support provider-agnostic creation and deletion' to better convey the scope of changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

backend/app/tests/api/routes/test_assistants.py (1)
1-10: Remove the duplicate patch import.
Line 9 redefines patch already imported on Line 1.
🧹 Proposed fix
-from unittest.mock import patch
 from typing import Any
 from uuid import uuid4
@@
-from unittest.mock import patch
 from app.tests.utils.llm_provider import mock_openai_assistant
backend/app/tests/crud/collections/collection/test_crud_collection_read_all.py (1)
18-24: Double creation of collection will cause errors.

get_assistant_collection already persists the collection to the database (per the relevant snippet, it calls CollectionCrud(db, project.id).create(collection)). Calling crud.create(collection, documents) on line 24 attempts to create the same collection again, which will likely cause a primary key constraint violation or duplicate entries.
🐛 Proposed fix: use the returned collection without re-creating
 def create_collections(db: Session, n: int) -> Collection:
     crud = None
     project = get_project(db)
     openai_mock = OpenAIMock()
     with openai_mock.router:
         client = OpenAI(api_key="sk-test-key")
         for _ in range(n):
             collection = get_assistant_collection(db, project=project)
             store = DocumentStore(db, project_id=collection.project_id)
             documents = store.fill(1)
             if crud is None:
                 crud = CollectionCrud(db, collection.project_id)
-            crud.create(collection, documents)
+            # Collection is already created by get_assistant_collection
+            # If documents need to be associated, use a different method
+            # or modify get_assistant_collection to accept documents

         return crud.project_id
backend/app/tests/services/collections/test_delete_collection.py (1)
1-12: Critical: Missing Session import causing pipeline failure.

The pipeline failure indicates Session is not defined. The test functions use db: Session type hints but Session is not imported.
Proposed fix
 from typing import Any
 from unittest.mock import patch, MagicMock
 from uuid import uuid4, UUID
 
+from sqlmodel import Session
 
 from app.models.collection import DeletionRequest
-
 from app.tests.utils.utils import get_project

🤖 Fix all issues with AI agents

In `@backend/app/alembic/versions/041_extend_collection_table_for_provider_.py`:
- Around line 31-40: The migration adds a NOT NULL column "provider" via
op.add_column using provider_type which will fail on existing rows; change the
migration to add the column as nullable (set nullable=True) or with a
server_default, then run the UPDATE (op.execute("UPDATE collection SET provider
= 'OPENAI' WHERE provider IS NULL")), and finally call
op.alter_column("collection", "provider", existing_type=provider_type,
nullable=False) to enforce NOT NULL; reference the op.add_column call that uses
provider_type and the op.execute update and add the op.alter_column step (or
alternatively add server_default in op.add_column) to fix the ordering and avoid
migration failures.

In `@backend/app/models/collection.py`:
- Around line 31-37: The provider field on the Collection model is incorrectly
defined as a one-element tuple and uses a set for sa_column_kwargs; update the
ProviderType Field declaration (the provider attribute) to return just
Field(...) (remove the trailing comma and surrounding parentheses so it's not a
tuple) and change sa_column_kwargs from {"LLM provider used for this
collection"} to {"comment": "LLM provider used for this collection"} so
SQLAlchemy gets a proper dict comment.

In `@backend/app/services/collections/create_collection.py`:
- Around line 158-161: Replace the explicit equality check against True with a
direct truthiness test: locate the conditional using with_assistant (the if
statement that currently reads with_assistant == True) in create_collection
(affecting creation_request.provider assignment) and change it to use if
with_assistant: (removing the unnecessary parentheses/equality comparison) so
the block sets creation_request.provider = "openai" when with_assistant is
truthy.

In `@backend/app/services/collections/providers/base.py`:
- Around line 22-28: The __init__ method in the provider base class is missing a
return type annotation; update the constructor signature (def __init__(self,
client: Any)) to include an explicit return type of None (def __init__(self,
client: Any) -> None:) so it follows the project's type-hinting guidelines while
leaving the client parameter and assignment (self.client = client) unchanged.

In `@backend/app/services/collections/providers/openai.py`:
- Around line 44-47: The log message prefixes in OpenAIProvider's create method
are incorrect—replace instances of "[OpenAIProvider.execute]" with
"[OpenAIProvider.create]" so the log prefix matches the method name; update all
logger calls inside the create method (the logger.info/error lines that include
vector_store_id, batches, and any other similar messages) to use the correct
"[OpenAIProvider.create]" tag.
- Around line 120-124: The cleanup method is calling _backout with the wrong
arguments; instead of passing result.llm_service_id and result.llm_service_name
it must instantiate the same CRUD class used in delete (based on
result.llm_service_name) and pass that CRUD instance plus the llm_service_id to
_backout; update cleanup to mirror delete’s pattern: determine which CRUD to
construct (e.g., OpenAIAssistantCRUD, PineconeVectorStoreCRUD, etc.) using
result.llm_service_name, create the CRUD instance, then call _backout(crud,
result.llm_service_id).

In
`@backend/app/tests/crud/collections/collection/test_crud_collection_read_one.py`:
- Around line 15-20: The test creates an unused OpenAI instance assigned to the
variable client (OpenAI(api_key="sk-test-key")) inside the openai_mock.router
context; remove that unused assignment or, if the OpenAI constructor must run
for its side effects, change the assignment to _ (i.e., assign to a throwaway
variable) so client is not defined but the constructor still runs; update the
code around the OpenAIMock context where client is currently created.

In `@backend/app/tests/utils/collection.py`:
- Around line 69-75: The Collection creation hardcodes llm_service_name via
get_service_name("openai") which mismatches the dynamic provider argument;
update the llm_service_name assignment in the Collection constructor to derive
the service name from the provider parameter (e.g., call get_service_name with
provider or a normalized form of provider) so llm_service_name aligns with the
provider value used for provider and llm_service_id.

In `@backend/app/tests/utils/llm_provider.py`:
- Around line 22-26: The function mock_openai_assistant currently uses a mutable
default for vector_store_ids which can lead to surprising statefulness; change
the signature to use vector_store_ids: Optional[list[str]] = None and inside
mock_openai_assistant initialize vector_store_ids = ["vs_1", "vs_2"] if
vector_store_ids is None (preserving the Optional[list[str]] type hint), so
callers get a fresh list each invocation and you avoid mutating a module-level
default.

🧹 Nitpick comments (13)

backend/app/services/collections/providers/base.py (1)
37-51: Align create/delete/cleanup docstrings with the actual signature/return.
The current docs mention args (batch_size, with_assistant, assistant_options) and returns (llm_service_id, llm_service_name) that don’t exist in the signature and can mislead implementers.
📝 Suggested docstring cleanup
-        Args:
-            collection_request: Collection parameters (name, description, document list, etc.)
-            storage: Cloud storage instance for file access
-            document_crud: DocumentCrud instance for fetching documents
-            batch_size: Number of documents to process per batch
-            with_assistant: Whether to create an assistant/agent
-            assistant_options: Options for assistant creation (provider-specific)
-
-        Returns:
-            llm_service_id: ID of the resource to delete
-            llm_service_name: Name of the service (determines resource type)
+        Args:
+            collection_request: Collection parameters (name, description, document list, etc.)
+            storage: Cloud storage instance for file access
+            document_crud: DocumentCrud instance for fetching documents
+
+        Returns:
+            Collection created by the provider
backend/app/tests/crud/collections/collection/test_crud_collection_create.py (1)
18-24: Prefer ProviderType.OPENAI over a raw string.

Using the enum keeps values consistent with the model definition and avoids casing drift.
♻️ Suggested tweak
-from app.models import DocumentCollection, Collection
+from app.models import DocumentCollection, Collection, ProviderType
@@
-            provider="OPENAI",
+            provider=ProviderType.OPENAI,
backend/app/services/collections/providers/registry.py (1)
44-71: Unreachable else branch: LLMProvider.get() already raises for unsupported providers.

The else branch (lines 65-69) can never execute because LLMProvider.get(provider) on line 47 already raises a ValueError for any unsupported provider. This is dead code.

Consider removing the else branch or restructuring the logic to avoid the redundant check:
♻️ Proposed refactor
 def get_llm_provider(
     session: Session, provider: str, project_id: int, organization_id: int
 ) -> BaseProvider:
     provider_class = LLMProvider.get(provider)
 
     credentials = get_provider_credential(
         session=session,
         provider=provider,
         project_id=project_id,
         org_id=organization_id,
     )
 
     if not credentials:
         raise ValueError(
             f"Credentials for provider '{provider}' not configured for this project."
         )
 
     if provider == LLMProvider.OPENAI:
         if "api_key" not in credentials:
             raise ValueError("OpenAI credentials not configured for this project.")
         client = OpenAI(api_key=credentials["api_key"])
-    else:
-        logger.error(
-            f"[get_llm_provider] Unsupported provider type requested: {provider}"
-        )
-        raise ValueError(f"Provider '{provider}' is not supported.")
+    else:
+        # Future providers will be handled here
+        raise NotImplementedError(f"Client creation for '{provider}' not implemented.")
 
     return provider_class(client=client)
Alternatively, when adding new providers, this pattern ensures extensibility while keeping the else branch meaningful.
backend/app/tests/crud/collections/collection/test_crud_collection_delete.py (1)
12-14: Add proper type hints for function parameters.

As per coding guidelines, all function parameters should have type hints. The client parameter lacks a type hint, and project_id should be Optional[int] since it can be None.
♻️ Proposed fix
+from typing import Optional
+
-def get_assistant_collection_for_delete(
-    db: Session, client=None, project_id: int = None
-) -> Collection:
+def get_assistant_collection_for_delete(
+    db: Session, client: Optional[OpenAI] = None, project_id: Optional[int] = None
+) -> Collection:
backend/app/services/collections/helpers.py (2)
20-27: Consider raising an error or logging for unknown providers.

get_service_name silently returns an empty string for unknown providers. This could lead to subtle bugs where an invalid provider name goes unnoticed. Consider either raising an error or logging a warning.
♻️ Suggested improvement
 def get_service_name(provider: str) -> str:
     """Get the collection service name for a provider."""
     names = {
         "openai": "openai vector store",
         #   "bedrock": "bedrock knowledge base",
         #  "gemini": "gemini file search store",
     }
-    return names.get(provider.lower(), "")
+    name = names.get(provider.lower())
+    if name is None:
+        logger.warning(f"[get_service_name] Unknown provider: {provider}")
+        return ""
+    return name
116-132: Consider using Session type instead of SessionDep for helper functions.

SessionDep is typically a FastAPI dependency type alias (e.g., Annotated[Session, Depends(get_db)]) meant for route handlers. For a standalone helper function called from various places, using the plain Session type is more appropriate and avoids coupling to FastAPI's dependency injection.
♻️ Proposed fix
+from sqlmodel import Session
+
 def ensure_unique_name(
-    session: SessionDep,
+    session: Session,
     project_id: int,
     requested_name: str,
 ) -> str:
backend/app/alembic/versions/041_extend_collection_table_for_provider_.py (1)
67-67: Use explicit constraint name for unique constraint.

Using None as the constraint name causes Alembic to auto-generate a name, which may differ across environments and complicate future migrations. The downgrade at line 98 already expects collection_name_key - ensure consistency.
♻️ Proposed fix
-    op.create_unique_constraint(None, "collection", ["name"])
+    op.create_unique_constraint("collection_name_key", "collection", ["name"])
backend/app/tests/services/collections/test_create_collection.py (2)
117-119: Add return type annotation to test function.

Per coding guidelines, all functions should have type hints. The db parameter also needs a type hint.
Proposed fix
 def test_execute_job_success_flow_updates_job_and_creates_collection(
-    mock_get_llm_provider, db: Session
-):
+    mock_get_llm_provider: MagicMock, db: Session
+) -> None:
193-195: Add type hints to test function parameters.

Per coding guidelines, add type hints to all function parameters and return values.
Proposed fix
 def test_execute_job_assistant_create_failure_marks_failed_and_deletes_collection(
-    mock_get_llm_provider, db
-):
+    mock_get_llm_provider: MagicMock, db: Session
+) -> None:
backend/app/tests/utils/llm_provider.py (1)
176-179: Add return type annotation.

Per coding guidelines, add return type hints to all functions.
Proposed fix
 def get_mock_provider(
     llm_service_id: str = "mock_service_id",
     llm_service_name: str = "mock_service_name",
-):
+) -> MagicMock:
backend/app/services/collections/create_collection.py (1)

142-142: Unused task_instance parameter.

The task_instance parameter is flagged as unused by static analysis. If this is required for the Celery task interface signature, consider adding a comment to clarify this, or prefixing with underscore (_task_instance) to indicate intentional non-use.
backend/app/tests/services/collections/test_delete_collection.py (1)
74-77: Add type hints to test function parameters.

Per coding guidelines, add type hints to all function parameters.
Proposed fix
 `@patch`("app.services.collections.delete_collection.get_llm_provider")
 def test_execute_job_delete_success_updates_job_and_calls_delete(
-    mock_get_llm_provider, db
-):
+    mock_get_llm_provider: MagicMock, db: Session
+) -> None:
backend/app/models/collection.py (1)

171-177: Align ProviderOptions with ProviderType enum for consistency.

The provider field in ProviderOptions uses Literal["openai"] while the Collection model uses the ProviderType enum. Since ProviderType is already defined for this purpose and the Collection model's description anticipates multiple providers ("openai", "bedrock", "gemini", etc.), use ProviderType instead of Literal to maintain consistency and prepare for future provider support.

backend/app/alembic/versions/041_extend_collection_table_for_provider_.py

backend/app/models/collection.py

backend/app/services/collections/create_collection.py

backend/app/services/collections/providers/base.py

backend/app/services/collections/providers/openai.py

backend/app/tests/crud/collections/collection/test_crud_collection_read_one.py

backend/app/tests/utils/collection.py

backend/app/tests/utils/llm_provider.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@backend/app/tests/services/collections/test_delete_collection.py`:
- Around line 73-76: Update the test function signatures to include typing:
annotate the patched mock parameter as MagicMock, the db parameter as Session,
and add an explicit return type -> None for the listed tests
(test_execute_job_delete_success_updates_job_and_calls_delete,
test_execute_job_delete_failure_marks_job_failed,
test_execute_job_delete_success_with_callback_sends_success_payload,
test_execute_job_delete_remote_failure_with_callback_sends_failure_payload);
import typing symbols if missing (from unittest.mock import MagicMock and from
sqlalchemy.orm import Session) and apply the annotations to the mock and db
parameters and the function return type for each test.

backend/app/tests/services/collections/test_delete_collection.py

codecov · 2026-01-20T06:30:45Z

Codecov Report

❌ Patch coverage is 90.78947% with 35 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...end/app/services/collections/providers/registry.py	48.38%	16 Missing ⚠️
backend/app/tests/utils/llm_provider.py	83.33%	10 Missing ⚠️
backend/app/services/collections/providers/base.py	81.25%	3 Missing ⚠️
...ckend/app/services/collections/providers/openai.py	93.02%	3 Missing ⚠️
...kend/app/services/collections/create_collection.py	90.00%	2 Missing ⚠️
backend/app/api/routes/collections.py	66.66%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

backend/app/services/collections/helpers.py (1)
83-102: Add missing type hints and fix function name typo

The function pick_service_for_documennt has a typo in the name and is missing type hint annotations for session, a_crud, and v_crud parameters, as well as a return type annotation. Per coding guidelines, all function parameters and return values require type hints.
Required changes
-def pick_service_for_documennt(session, doc_id: UUID, a_crud, v_crud):
+def pick_service_for_document(session: SessionDep, doc_id: UUID, a_crud: OpenAIAssistantCrud, v_crud: OpenAIVectorStoreCrud) -> Union[OpenAIAssistantCrud, OpenAIVectorStoreCrud]:
Update imports in backend/app/api/routes/documents.py:

Change import from pick_service_for_documennt to pick_service_for_document (line 31)

Change both function calls from pick_service_for_documennt to pick_service_for_document (lines 190, 223)
backend/app/models/collection.py (2)
87-105: Add explicit return type to model_post_init.

Return type is missing, which violates the repo’s type-hint requirement.
🔧 Suggested fix
-    def model_post_init(self, __context: Any):
+    def model_post_init(self, __context: Any) -> None:
         self.documents = list(set(self.documents))
As per coding guidelines, please add explicit return types.

179-188: Add return type (and a concrete class type) for extract_super_type.

This method yields tuples but has no return type hint.
🔧 Suggested fix
-from typing import Any, Literal
+from typing import Any, Literal, Iterator
@@
-    def extract_super_type(self, cls: "CreationRequest"):
+    def extract_super_type(self, cls: type[SQLModel]) -> Iterator[tuple[str, Any]]:
         for field_name in cls.model_fields.keys():
             field_value = getattr(self, field_name)
             yield (field_name, field_value)
As per coding guidelines, please add explicit return types.

🤖 Fix all issues with AI agents

In `@backend/app/alembic/versions/041_extend_collection_table_for_provider_.py`:
- Line 68: Replace the global unique constraint on the collection name with a
composite unique constraint for per-project uniqueness: update the migration
call to op.create_unique_constraint (the one currently creating a unique
constraint on "collection" with ["name"]) so it instead creates the constraint
on ["project_id","name"] to match the application logic enforced by
ensure_unique_name and CollectionCrud.exists_by_name.

In `@backend/app/services/collections/create_collection.py`:
- Around line 135-143: Annotate the unused job runner arg by importing
typing.Any and changing the execute_job signature to declare task_instance: Any,
then explicitly mark it as intentionally unused inside the function (e.g. assign
it to _ or add a comment like "_ = task_instance  # intentionally unused") so
Ruff stops flagging it; keep other parameter annotations and the return type
as-is and reference the execute_job function and the task_instance parameter
when making the change.

In `@backend/app/services/collections/providers/openai.py`:
- Around line 16-21: The OpenAIProvider.__init__ duplicates assigning
self.client after BaseProvider.__init__ already sets it and the constructor
lacks the return type hint; remove the redundant "self.client = client"
assignment in the OpenAIProvider.__init__ and update the method signature to
include the None return type (i.e., def __init__(self, client: OpenAI) -> None:)
while leaving the call to super().__init__(client) intact.

In `@backend/app/tests/services/collections/providers/test_openai.py`:
- Around line 1-13: The test module name collides with another test (causing
pytest collection errors); rename the test file from test_openai.py to a unique
name like test_openai_provider.py, update any references/imports if present, and
rerun test discovery to confirm no duplicate modules; focus on the module that
tests OpenAIProvider and helpers (OpenAIProvider, generate_openai_id,
get_mock_openai_client_with_vector_store) so their imports still resolve after
renaming.

♻️ Duplicate comments (3)

backend/app/tests/services/collections/test_delete_collection.py (2)
73-76: Missing type hint for db parameter.

The db parameter is missing the Session type annotation. This was flagged in a previous review and should be addressed for consistency with coding guidelines.
 `@patch`("app.services.collections.delete_collection.get_llm_provider")
 def test_execute_job_delete_success_updates_job_and_calls_delete(
-    mock_get_llm_provider: MagicMock, db
+    mock_get_llm_provider: MagicMock, db: Session
 ) -> None:
139-142: Missing Session type hint on db parameter in remaining test functions.

Apply the same fix to the other test functions:

test_execute_job_delete_failure_marks_job_failed (line 141)

test_execute_job_delete_success_with_callback_sends_success_payload (line 216)

test_execute_job_delete_remote_failure_with_callback_sends_failure_payload (line 298)

Also applies to: 213-217, 293-299
backend/app/models/collection.py (1)
31-37: Fix provider Field tuple wrapping.

Line 31 wraps Field(...) in a one-element tuple, which breaks SQLModel field definition and schema mapping.
🐛 Proposed fix
-    provider: ProviderType = (
-        Field(
-            nullable=False,
-            description="LLM provider used for this collection (e.g., 'openai', 'bedrock', 'gemini', etc)",
-            sa_column_kwargs={"comment": "LLM provider used for this collection"},
-        ),
-    )
+    provider: ProviderType = Field(
+        nullable=False,
+        description="LLM provider used for this collection (e.g., 'openai', 'bedrock', 'gemini', etc)",
+        sa_column_kwargs={"comment": "LLM provider used for this collection"},
+    )

backend/app/alembic/versions/041_extend_collection_table_for_provider_.py

backend/app/services/collections/create_collection.py

backend/app/services/collections/providers/openai.py

backend/app/tests/services/collections/providers/test_openai_provider.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@backend/app/tests/services/collections/providers/test_openai_provider.py`:
- Around line 98-113: The test test_delete_openai_assistant creates a Collection
without required fields; update the Collection instantiation used in
test_delete_openai_assistant (or replace with the test factory from
backend/app/tests/utils/collection.py) to include the non-nullable fields
provider and project_id (e.g., provider=ProviderType.OPENAI or "OPENAI" and a
valid project_id int), so the Collection(...) passed to OpenAIProvider.delete is
fully valid; keep the rest of the test (MagicMock client, OpenAIProvider,
patching OpenAIAssistantCrud, and the final assert) unchanged.

♻️ Duplicate comments (1)

backend/app/tests/utils/llm_provider.py (1)
22-26: Mutable default argument detected.

Using a mutable list as a default argument can cause unexpected behavior. Replace with None and initialize within the function.
Proposed fix
 def mock_openai_assistant(
     assistant_id: str = "assistant_mock",
-    vector_store_ids: Optional[list[str]] = ["vs_1", "vs_2"],
+    vector_store_ids: Optional[list[str]] = None,
     max_num_results: int = 30,
 ) -> OpenAIAssistant:
+    if vector_store_ids is None:
+        vector_store_ids = ["vs_1", "vs_2"]
     return OpenAIAssistant(

🧹 Nitpick comments (2)

backend/app/tests/utils/llm_provider.py (2)
176-179: Add return type hint to get_mock_provider.

Per coding guidelines, all function return values should have type hints.
Proposed fix
 def get_mock_provider(
     llm_service_id: str = "mock_service_id",
     llm_service_name: str = "mock_service_name",
-):
+) -> MagicMock:
111-134: Add return type hint for consistency.

Similar to get_mock_provider, this function should include a return type hint.
Proposed fix
-def get_mock_openai_client_with_vector_store() -> MagicMock:
+def get_mock_openai_client_with_vector_store() -> MagicMock:
The function already has the return type hint. No change needed.
Actually, this function already has the type hint. LGTM.

backend/app/tests/services/collections/providers/test_openai_provider.py

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/tests/api/routes/collections/test_collection_info.py (1)
194-230: Undefined name get_collection will cause test failure.

Line 204 calls get_collection(db, project), but this function is not imported. The imports at lines 9-12 only include get_assistant_collection and get_vector_store_collection. This will raise a NameError at runtime.

Additionally, line 197 is missing the type hint for user_api_key_header (should be dict[str, str] per coding guidelines).
🐛 Proposed fix
 def test_collection_info_include_docs_and_url(
     client: TestClient,
     db: Session,
-    user_api_key_header,
+    user_api_key_header: dict[str, str],
 ) -> None:
     """
     Test that when include_docs=true and include_url=true,
     the endpoint returns documents with their URLs.
     """
     project = get_project(db, "Dalgo")
-    collection = get_collection(db, project)
+    collection = get_assistant_collection(db, project)

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@backend/app/models/collection.py`:
- Around line 15-19: ProviderType enum values are uppercase and clash with
ProviderOptions.provider/CreationRequest which expect lowercase literals; change
the enum members in ProviderType (e.g., ProviderType.OPENAI) to use lowercase
strings (e.g., "openai") so Pydantic validation matches the Literal["openai"]
used by ProviderOptions/CreationRequest, and ensure any other members follow the
same lowercase convention.

♻️ Duplicate comments (1)

backend/app/models/collection.py (1)

40-46: Remove tuple wrapping around the provider Field.

This still creates a one‑element tuple instead of a Field, which breaks SQLModel field definition. This was flagged previously and appears unchanged.

🐛 Proposed fix

-    provider: ProviderType = (
-        Field(
-            nullable=False,
-            description="LLM provider used for this collection (e.g., 'openai', 'bedrock', 'gemini', etc)",
-            sa_column_kwargs={"comment": "LLM provider used for this collection"},
-        ),
-    )
+    provider: ProviderType = Field(
+        nullable=False,
+        description="LLM provider used for this collection (e.g., 'openai', 'bedrock', 'gemini', etc)",
+        sa_column_kwargs={"comment": "LLM provider used for this collection"},
+    )

🧹 Nitpick comments (1)

backend/app/models/collection.py (1)
112-113: Add missing type hints to comply with project typing rules.

Both model_post_init and extract_super_type are missing return/parameter typing.
♻️ Proposed fix
-from typing import Any, Literal
+from typing import Any, Literal, Iterator

-    def model_post_init(self, __context: Any):
+    def model_post_init(self, __context: Any) -> None:
         self.documents = list(set(self.documents))

-    def extract_super_type(self, cls: "CreationRequest"):
+    def extract_super_type(self, cls: type[SQLModel]) -> Iterator[tuple[str, Any]]:
         for field_name in cls.model_fields.keys():
             field_value = getattr(self, field_name)
             yield (field_name, field_value)
Please verify this still satisfies the project’s typing checks (e.g., mypy/pyright). As per coding guidelines.

Also applies to: 193-196

backend/app/models/collection.py

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/tests/api/routes/collections/test_collection_info.py (1)
194-204: Missing type hint for user_api_key_header parameter.

Line 197 is missing the type annotation dict[str, str] that is consistently applied to the same parameter in other test functions in this file (lines 43, 83, 113, 148).
Proposed fix
 def test_collection_info_include_docs_and_url(
     client: TestClient,
     db: Session,
-    user_api_key_header,
+    user_api_key_header: dict[str, str],
 ) -> None:
As per coding guidelines, always add type hints to all function parameters.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

backend/app/services/collections/delete_collection.py (1)
141-149: Add a type hint for task_instance (and mark it unused).

This param is untyped and unused; add a type hint and explicitly ignore it to satisfy typing rules and linting. As per coding guidelines, please add explicit type hints and mark unused params.
🔧 Suggested fix
+from typing import Any
@@
 def execute_job(
     request: dict,
     project_id: int,
     organization_id: int,
     task_id: str,
     job_id: str,
     collection_id: str,
-    task_instance,
+    task_instance: Any,
 ) -> None:
     """Celery worker entrypoint for deleting a collection (both remote and local)."""
+    _ = task_instance  # reserved for task runner hooks; unused for now
backend/app/models/collection.py (1)
112-113: Add explicit -> None return type to model_post_init.

The model_post_init method is missing its return type annotation. Per coding guidelines and Pydantic conventions, all function return types must be explicitly annotated.
Suggested fix
-    def model_post_init(self, __context: Any):
+    def model_post_init(self, __context: Any) -> None:
         self.documents = list(set(self.documents))
backend/app/services/collections/create_collection.py (1)
165-190: Move provider.create() inside the active session context.

document_crud is created inside the with Session(engine) block but passed to provider.create(), which is called outside the block. The provider.create() method calls batch_documents(), which in turn calls document_crud.read_each(). Since DocumentCrud stores the session reference and executes queries via self.session.exec(), any database access after the session closes will fail.
🔧 Suggested fix
        with Session(engine) as session:
            collection_job_crud = CollectionJobCrud(session, project_id)
            collection_job = collection_job_crud.read_one(job_uuid)
            collection_job = collection_job_crud.update(
                job_uuid,
                CollectionJobUpdate(
                    task_id=task_id,
                    status=CollectionJobStatus.PROCESSING,
                ),
            )

            storage = get_cloud_storage(session=session, project_id=project_id)
            document_crud = DocumentCrud(session, project_id)

            provider = get_llm_provider(
                session=session,
                provider=creation_request.provider,
                project_id=project_id,
                organization_id=organization_id,
            )
-
-        result = provider.create(
-            collection_request=creation_request,
-            storage=storage,
-            document_crud=document_crud,
-        )
+            result = provider.create(
+                collection_request=creation_request,
+                storage=storage,
+                document_crud=document_crud,
+            )
backend/app/tests/api/routes/collections/test_collection_info.py (1)
194-198: Add missing type hint for user_api_key_header parameter.

This parameter is missing the type annotation dict[str, str], which is used consistently in other test functions in this file (lines 43, 83, 113, 148, 179). As per coding guidelines, all function parameters should have type hints.
Proposed fix
 def test_collection_info_include_docs_and_url(
     client: TestClient,
     db: Session,
-    user_api_key_header,
+    user_api_key_header: dict[str, str],
 ) -> None:
backend/app/tests/crud/collections/collection/test_crud_collection_delete.py (1)
12-31: Add type hints and fix potential None value for project_id.

The function parameters client and project_id are missing type hints, violating the coding guidelines. Additionally, project_id can be None by default but is passed directly to Collection(project_id=project_id, ...) where the model requires a non-null value. The function retrieves a project via get_project(db) but never uses its id as a fallback.
Proposed fix
+from typing import Optional
+from openai import OpenAI as OpenAIClient
+
 def get_assistant_collection_for_delete(
-    db: Session, client=None, project_id: int = None
+    db: Session,
+    client: Optional[OpenAIClient] = None,
+    project_id: Optional[int] = None,
 ) -> Collection:
     project = get_project(db)
     if client is None:
         client = OpenAI(api_key="test_api_key")
+    if project_id is None:
+        project_id = project.id

     vector_store = client.vector_stores.create()
     assistant = client.beta.assistants.create(
         model="gpt-4o",
         tools=[{"type": "file_search"}],
         tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
     )

     return Collection(
         project_id=project_id,
         llm_service_id=assistant.id,
         llm_service_name="gpt-4o",
         provider=ProviderType.openai,
     )
backend/app/tests/utils/collection.py (1)
43-51: Remove organization_id field — it no longer exists on the Collection model.

The Collection model was scoped to projects only and no longer has an organization_id field. Setting this field during instantiation will cause a model validation error.
Proposed fix
     collection = Collection(
         id=collection_id or uuid4(),
         project_id=project.id,
-        organization_id=project.organization_id,
         llm_service_name=model,
         llm_service_id=assistant_id,
         provider=ProviderType.openai,
     )

🤖 Fix all issues with AI agents

In `@backend/app/api/docs/collections/create.md`:
- Around line 13-18: Update the wording and capitalization for all occurrences
of "Openai" and "Openai vector Store" in the document: replace with "OpenAI" and
"OpenAI vector store" respectively, and tighten the sentence about cleanup to
read more concisely (e.g., "If any LLM service interaction fails, all service
resources are cleaned up; for example, if an OpenAI vector store cannot be
created, any files uploaded to OpenAI are removed. Failures can result from
service outages, invalid parameters, or unsupported document types such as
unparseable PDFs."). Apply the same edits to the similar paragraph referenced
(the other occurrence around lines 20-22).

In `@backend/app/services/collections/create_collection.py`:
- Around line 149-155: Initialize creation_request to None before the try block
so the except path can reference it without causing UnboundLocalError;
specifically, add creation_request = None above the try, keep the
CreationRequest(**request) call inside the try to assign creation_request, and
ensure any failure callback logic (which references creation_request) will
handle a None value safely if construction failed.

♻️ Duplicate comments (3)

backend/app/services/collections/create_collection.py (1)

135-143: Type-annotate and mark task_instance as intentionally unused.

This param is untyped and unused; add a type hint and explicitly ignore it to satisfy typing rules and linting. As per coding guidelines, please add explicit type hints and mark unused params.
🔧 Suggested fix
+from typing import Any
@@
 def execute_job(
     request: dict,
     with_assistant: bool,
     project_id: int,
     organization_id: int,
     task_id: str,
     job_id: str,
-    task_instance,
+    task_instance: Any,
 ) -> None:
@@
-    start_time = time.time()
+    _ = task_instance  # reserved for task runner hooks; unused for now
+    start_time = time.time()

backend/app/models/collection.py (1)

40-46: Fix provider Field tuple wrapping (SQLModel won’t map the column).

The trailing comma makes this a tuple rather than a Field, so the column mapping will be broken.

🔧 Suggested fix

-    provider: ProviderType = (
-        Field(
-            nullable=False,
-            description="LLM provider used for this collection (e.g., 'openai', 'bedrock', 'gemini', etc)",
-            sa_column_kwargs={"comment": "LLM provider used for this collection"},
-        ),
-    )
+    provider: ProviderType = Field(
+        nullable=False,
+        description="LLM provider used for this collection (e.g., 'openai', 'bedrock', 'gemini', etc)",
+        sa_column_kwargs={"comment": "LLM provider used for this collection"},
+    )

backend/app/tests/services/collections/test_delete_collection.py (1)

73-76: Type-annotate db in patched tests.

Several test functions still accept db without a type hint. As per coding guidelines, add db: Session to each.

🔧 Suggested fix

 `@patch`("app.services.collections.delete_collection.get_llm_provider")
 def test_execute_job_delete_success_updates_job_and_calls_delete(
-    mock_get_llm_provider: MagicMock, db
+    mock_get_llm_provider: MagicMock, db: Session
 ) -> None:
@@
 `@patch`("app.services.collections.delete_collection.get_llm_provider")
 def test_execute_job_delete_failure_marks_job_failed(
-    mock_get_llm_provider: MagicMock, db
+    mock_get_llm_provider: MagicMock, db: Session
 ) -> None:
@@
 def test_execute_job_delete_success_with_callback_sends_success_payload(
     mock_send_callback: MagicMock,
     mock_get_llm_provider: MagicMock,
-    db,
+    db: Session,
 ) -> None:
@@
 def test_execute_job_delete_remote_failure_with_callback_sends_failure_payload(
     mock_send_callback: MagicMock,
     mock_get_llm_provider: MagicMock,
-    db,
+    db: Session,
 ) -> None:

Also applies to: 138-141, 210-214, 291-295

backend/app/api/docs/collections/create.md

backend/app/services/collections/create_collection.py

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/services/collections/create_collection.py (1)

166-191: Keep the DB session alive for provider.create inputs.

document_crud (and potentially storage) are built inside the with Session(...) block but used after the block exits. If provider.create touches the DB via document_crud, this will hit a closed session at runtime. Move the provider.create(...) call into the same session scope (or recreate DocumentCrud inside a new session that stays open for the call).

🔧 Suggested fix

-        with Session(engine) as session:
+        with Session(engine) as session:
             collection_job_crud = CollectionJobCrud(session, project_id)
             collection_job = collection_job_crud.read_one(job_uuid)
             collection_job = collection_job_crud.update(
                 job_uuid,
                 CollectionJobUpdate(
                     task_id=task_id,
                     status=CollectionJobStatus.PROCESSING,
                 ),
             )
 
             storage = get_cloud_storage(session=session, project_id=project_id)
             document_crud = DocumentCrud(session, project_id)
 
             provider = get_llm_provider(
                 session=session,
                 provider=creation_request.provider,
                 project_id=project_id,
                 organization_id=organization_id,
             )
-
-        result = provider.create(
-            collection_request=creation_request,
-            storage=storage,
-            document_crud=document_crud,
-        )
+            result = provider.create(
+                collection_request=creation_request,
+                storage=storage,
+                document_crud=document_crud,
+            )

♻️ Duplicate comments (1)

backend/app/services/collections/create_collection.py (1)
135-143: Type-annotate and mark task_instance as intentionally unused.

Ruff flags task_instance as unused, and it lacks a type hint. If it’s required by the task runner, explicitly annotate it and mark it unused to satisfy linting.
🔧 Suggested fix
+from typing import Any
@@
 def execute_job(
     request: dict,
     with_assistant: bool,
     project_id: int,
     organization_id: int,
     task_id: str,
     job_id: str,
-    task_instance,
+    task_instance: Any,
 ) -> None:
@@
+    _ = task_instance  # reserved for task runner hooks; unused for now
As per coding guidelines, please add explicit type hints for all parameters and returns.

vprashrex · 2026-01-20T17:59:00Z

Collection crud -> it is soft deleting the collection entity, but the collection data is usually present in the db, but the DB unique constraint (project_id, name) still treats deleted rows as existing.
However, exists_by_name ignores soft-deleted collections, which can lead to a mismatch where the router reports a name as available but the database later rejects it.

Suggestion:
add partial unique index -> only active row ... soft deleted rows will not be taken in account,
or else remove this line from the exists_by_nane function -> where(Collection.deleted_at.is_(None)) -> which would be non-feasible for the user since they cannot re-use the deleted username,

I am not sure about the whole flow. if u think my analogy is wrong then ignore it.

def exists_by_name(self, collection_name: str) -> bool:
        statement = (
            select(Collection.id)
            .where(Collection.project_id == self.project_id)
            .where(Collection.name == collection_name)
            .where(Collection.deleted_at.is_(None))  # -> here it will exclude the soft deleted collection name 
        )
        result = self.session.exec(statement).first()
        return result is not None

Only soft deleting is happening here in collection

def delete_by_id(self, collection_id: UUID) -> Collection:
       coll = self.read_one(collection_id)
       coll.deleted_at = now()

       return self._update(coll)

unique constraint is there at the db level

__table_args__ = (
       UniqueConstraint(
           "project_id",
           "name",
           name="uq_collection_project_id_name",  
       ),
   )

nishika26 added 3 commits January 20, 2026 11:23

pushing the logic first

fd18a02

fixing a delete mistake

364a2b5

pushing everything together

541e311

removing un-needed files

9ee861f

nishika26 marked this pull request as ready for review January 20, 2026 06:17

nishika26 changed the title ~~Enhancement/collection provider agnostic~~ Colelction: making the module provider agnostic Jan 20, 2026