From a715f6715076c628cf913ce6cfab0075004f450e Mon Sep 17 00:00:00 2001 From: "Charles P. Cross" <8572939+cpdata@users.noreply.github.com> Date: Tue, 14 Oct 2025 06:13:09 -0400 Subject: [PATCH] Add comprehensive project review documentation --- DISCREPANCIES.md | 20 +++++++ FINDINGS.md | 31 ++++++++++ ISSUES.md | 16 +++++ NEW_README.md | 126 ++++++++++++++++++++++++++++++++++++++++ PLAN.md | 27 +++++++++ PROJECT.md | 40 +++++++++++++ RECOMMENDATIONS.md | 12 ++++ SOT.md | 141 +++++++++++++++++++++++++++++++++++++++++++++ 8 files changed, 413 insertions(+) create mode 100644 DISCREPANCIES.md create mode 100644 FINDINGS.md create mode 100644 ISSUES.md create mode 100644 NEW_README.md create mode 100644 PLAN.md create mode 100644 PROJECT.md create mode 100644 RECOMMENDATIONS.md create mode 100644 SOT.md diff --git a/DISCREPANCIES.md b/DISCREPANCIES.md new file mode 100644 index 0000000..25d8932 --- /dev/null +++ b/DISCREPANCIES.md @@ -0,0 +1,20 @@ +# README vs. Codebase Discrepancies + +## API Surface +- **Missing registration APIs** – The README walkthrough relies on `MeshMind.register_entity`, `register_allowed_predicates`, `add_predicate`, `store_memory`, `add_memory`, and `add_triplet`. None of these methods exist on `meshmind.client.MeshMind`, nor do they exist elsewhere in the package. +- **No triplet storage path** – README describes storing graph triplets (node–edge–node). The shipped pipeline never touches `Triplet` and only calls `GraphDriver.upsert_entity`, so edges/predicates are never persisted. +- **Entity modelling mismatch** – README expects custom Pydantic models (e.g., `Person`) to be registered and enforced. Extraction currently hardcodes the `Memory` schema and only validates `entity_label` names against the provided classes’ `__name__`, without instantiating those models. + +## Feature Claims +- **CRUD breadth** – README lists add/search/update/delete as supported capabilities. Only `meshmind.api.memory_manager.MemoryManager` exposes update/delete helpers, and they are not surfaced via the documented high-level API. +- **Retrieval methods** – README promises embedding vector search, BM25, LLM reranking, fuzzy search, exact comparison, regex search, filters, and hybrid search. The code offers BM25, fuzzy, filters, and a simple hybrid scorer; it lacks standalone vector search, regex search, exact match utilities, and any LLM-based reranking. +- **Memory preprocessing** – README references importance ranking, deduplication, consolidation, compression, and expiry. Implementations exist, but importance ranking is a fixed default of `1.0`, consolidation only keeps the highest importance duplicate, and expiry/compression are isolated Celery tasks that require additional wiring. + +## Operational Expectations +- **Dependency assumptions** – README does not mention that a Memgraph instance, `mgclient`, `tiktoken`, and manual encoder registration are required. In practice, `MeshMind()` raises immediately when `mgclient` is absent, and `extract_memories` fails if the default embedding encoder is not manually registered. +- **Configuration** – README’s quickstart omits mandatory environment variables (OpenAI API key, Memgraph credentials) that `meshmind.core.config.settings` expects. +- **Testing/setup instructions** – README suggests features (e.g., graph relationships, rich retrieval) that the tests do not cover, and the declared Python requirement (`>=3.13` in `pyproject.toml`) conflicts with README/Contributing guidance (Python 3.10+). + +## Example Code +- The README’s example code would fail: `MeshMind` lacks the invoked methods, extraction would reject the `Person` label unless registered (yet registration does nothing), and storing custom metadata or edges is unsupported. +- The low-level `add_triplet` example assumes the driver can create relationship edges given subject/object names; there is no such helper in the codebase, and `MemgraphDriver.upsert_edge` expects UUIDs, not arbitrary entity names. diff --git a/FINDINGS.md b/FINDINGS.md new file mode 100644 index 0000000..c7858bf --- /dev/null +++ b/FINDINGS.md @@ -0,0 +1,31 @@ +# Additional Findings + +## Dependency Handling +- `meshmind.client.MeshMind` instantiates `MemgraphDriver` on every construction. When `mgclient` is missing (default in many development environments), the import raises immediately, preventing any other functionality (including retrieval helpers) from being used. +- `meshmind.pipeline.compress` sets `tiktoken = None` when the import fails, but still calls `tiktoken.get_encoding`, raising `AttributeError`. The helper in `meshmind.pipeline.preprocess.compress` catches `ImportError`, not `AttributeError`, so the failure propagates. +- `meshmind.core.utils` imports `tiktoken` unconditionally at import time, so simply importing `meshmind.core` without the package installed triggers a crash. +- The OpenAI SDK usage is inconsistent: `meshmind.core.embeddings.OpenAIEmbeddingEncoder.encode` expects dictionary-style responses, while the modern SDK returns typed objects. + +## Encoder Registry +- No encoders are registered by default. `meshmind.pipeline.extract.extract_memories` calls `EncoderRegistry.get(settings.EMBEDDING_MODEL)`; unless the caller has registered a matching encoder beforehand, the call raises `KeyError`. +- `MeshMind` does not register an encoder automatically when instantiated with the default OpenAI client, so the quickstart path fails without additional setup. + +## Graph Persistence +- `meshmind.pipeline.store.store_memories` only calls `graph_driver.upsert_entity`. There is no mechanism to create edges, update predicates, or link memories. `Triplet` from `meshmind.core.types` is unused throughout the codebase. +- `meshmind.db.memgraph_driver.MemgraphDriver.upsert_edge` expects UUIDs for subject/object, but no public API surfaces those identifiers or manages the relationship lifecycle. + +## Retrieval & Search +- `meshmind.retrieval.hybrid.hybrid_search` assumes that every memory already has an embedding and that an encoder is registered under `config.encoder`. It silently assigns a zero score to any memory missing an embedding, which may bury relevant results. +- There is no helper to pull memories out of Memgraph for retrieval; all search functions expect in-memory lists supplied by the caller. + +## CLI & Tasks +- `meshmind.cli.ingest.ingest_command` registers `entity_types=[Memory]`. Supplying a custom Pydantic model has no effect because extraction only validates `entity_label` names; it never instantiates the provided models. +- Scheduled tasks import `MemgraphDriver` at module load and swallow any exception, leaving `manager = None`. Task invocations then return early without logging, making failures hard to diagnose. + +## Testing & Tooling +- Test modules import the production code in ways that do not align with the current library versions (`openai.responses.create`, `Memory.pre_init`). Running `pytest` without heavy monkeypatching will fail. +- `pyproject.toml` pins `requires-python = ">=3.13"`, yet the documentation still references Python 3.10 and Poetry. Tooling commands in `Makefile` (ruff, isort, black) are not listed as dependencies. + +## Documentation Gaps +- Runtime prerequisites (Memgraph, Redis, mgclient, encoder registration) are not described in the existing README. +- There is no architecture guide explaining how pipeline modules, the CLI, Celery tasks, and retrieval helpers relate, making it hard for new contributors to navigate the codebase. diff --git a/ISSUES.md b/ISSUES.md new file mode 100644 index 0000000..0818449 --- /dev/null +++ b/ISSUES.md @@ -0,0 +1,16 @@ +# Issue Backlog + +- [ ] Restore the high-level client surface promised in the README (entity/predicate registration, `add_memory`, `store_memory`, `add_triplet`). +- [ ] Persist graph relationships: extend the storage pipeline to write edges/triplets instead of only upserting nodes. +- [ ] Provide a reliable vector search entrypoint and implement regex/exact-match/LLM rerank retrieval options as documented. +- [ ] Register a default embedding encoder (e.g., OpenAI) on startup so extraction works out of the box. +- [ ] Make `MeshMind` initialization resilient when `mgclient` or Memgraph are unavailable (lazy driver creation or optional in-memory driver). +- [ ] Fix `meshmind.pipeline.compress` to handle missing `tiktoken` gracefully (skip compression or supply a fallback encoder). +- [ ] Guard `meshmind.core.utils` against importing `tiktoken` at module import time to avoid `ModuleNotFoundError`. +- [ ] Update `OpenAIEmbeddingEncoder.encode` to use the modern OpenAI SDK response objects (access `.data`, not dictionary keys) and add error handling/tests. +- [ ] Rework Celery task initialization so `MemgraphDriver` is created lazily within tasks instead of module import time side effects. +- [ ] Ensure tests are executable: remove assumptions about non-existent hooks (`Memory.pre_init`), patch the OpenAI client correctly, and supply dependency fakes. +- [ ] Align Python version and tooling guidance across `pyproject.toml`, README, and CONTRIBUTING (e.g., Python >=3.13 vs. 3.10+, missing `ruff/isort/black` dependencies). +- [ ] Document runtime dependencies explicitly (Memgraph, Redis, OpenAI API key, mgclient) and provide setup scripts or docker-compose services. +- [ ] Build relationship/query abstractions on top of `MemgraphDriver` (e.g., `get_memories`, `search` APIs) instead of expecting consumers to craft Cypher manually. +- [ ] Add integration/e2e tests covering the ingestion pipeline end-to-end with a test Memgraph instance or an in-memory driver substitute. diff --git a/NEW_README.md b/NEW_README.md new file mode 100644 index 0000000..2f9707e --- /dev/null +++ b/NEW_README.md @@ -0,0 +1,126 @@ +# MeshMind (Current State) + +MeshMind is an experimental memory toolkit that combines LLM-assisted extraction with a property-graph backend. The present implementation focuses on turning raw text into `Memory` records, applying lightweight preprocessing, and storing them through a Memgraph-compatible driver. Retrieval helpers operate on in-memory collections of `Memory` objects and support lexical, fuzzy, and hybrid scoring. + +> **Note** +> The original README described a much richer API (entity/predicate registration, triplet storage, advanced retrieval). Those features are **not** implemented yet. This document reflects the functionality that exists today. + +## Features + +- `Memory` data model (`meshmind.core.types.Memory`) captures namespace, entity label, metadata, optional embeddings, timestamps, and TTL/importance metadata. +- `MeshMind` client (`meshmind.client.MeshMind`) wires together an OpenAI client, an embedding model name, and a Memgraph driver, exposing helpers for: + - `extract_memories` – LLM-based extraction that validates entity labels and populates embeddings through the encoder registry. + - `deduplicate`, `score_importance`, `compress` – preprocessing helpers. + - `store_memories` – persists each memory by calling `GraphDriver.upsert_entity`. +- Pipeline modules under `meshmind.pipeline` implement the extraction, preprocessing, compression, expiry, consolidation, and storage steps. +- Retrieval utilities under `meshmind.retrieval` provide TF-IDF (BM25-style) search, RapidFuzz fuzzy matching, metadata/namespace/entity-label filters, and a hybrid scorer that blends cosine similarity with lexical scores. +- `meshmind.api.memory_manager.MemoryManager` offers CRUD-style helpers over an injected graph driver (add/update/delete/list). +- `meshmind.tasks.scheduled` defines Celery tasks (optional) that invoke expiry, consolidation, and compression routines on a schedule. +- A CLI entry point (`python -m meshmind` or `meshmind` after installation) exposes an `ingest` subcommand for end-to-end extraction from local files/directories. + +## Requirements + +- Python 3.10+ (the package metadata currently targets 3.13; align your environment accordingly). +- A running [Memgraph](https://memgraph.com/) instance accessible via Bolt, plus the `mgclient` Python driver. +- An OpenAI API key for both extraction and embedding (set `OPENAI_API_KEY`). +- Optional services: Redis (if Celery tasks are used). +- Python dependencies listed in `pyproject.toml` (`openai`, `pydantic`, `rapidfuzz`, `scikit-learn`, `numpy`, `celery[redis]`, `sentence-transformers`, `tiktoken`, `pymgclient`, etc.). + +## Installation + +```bash +python -m venv .venv +source .venv/bin/activate +pip install -e . +``` + +Ensure Memgraph is running and environment variables are set: + +```bash +export OPENAI_API_KEY=... # required by openai.OpenAI() +export MEMGRAPH_URI=bolt://localhost:7687 +export MEMGRAPH_USERNAME=... +export MEMGRAPH_PASSWORD=... +``` + +Before calling extraction, register an embedding encoder that matches `settings.EMBEDDING_MODEL` (default `text-embedding-3-small`). For example: + +```python +from meshmind.core.embeddings import EncoderRegistry, OpenAIEmbeddingEncoder +EncoderRegistry.register("text-embedding-3-small", OpenAIEmbeddingEncoder()) +``` + +## Quickstart + +```python +from meshmind.client import MeshMind +from meshmind.core.types import Memory +from meshmind.core.embeddings import EncoderRegistry, OpenAIEmbeddingEncoder + +# Register the embedding encoder expected by the pipeline +EncoderRegistry.register("text-embedding-3-small", OpenAIEmbeddingEncoder()) + +mm = MeshMind() +texts = [ + "Jane Doe is a senior software engineer based in Berlin.", + "John Doe manages the infrastructure team in Berlin.", +] +memories = mm.extract_memories( + instructions="Extract each distinct person as a Memory.", + namespace="Company Directory", + entity_types=[Memory], # entity labels are validated against class names + content=texts, +) +memories = mm.deduplicate(memories) +memories = mm.score_importance(memories) +memories = mm.compress(memories) +mm.store_memories(memories) +``` + +## Retrieval Helpers + +The retrieval utilities operate on lists of `Memory` objects that are already loaded into Python (for example via `MemoryManager.list_memories`). + +```python +from meshmind.retrieval.search import search, search_bm25, search_fuzzy +from meshmind.core.types import Memory, SearchConfig +from meshmind.core.embeddings import EncoderRegistry + +# Register an encoder for hybrid/vector scoring +class DummyEncoder: + def encode(self, texts): + return [[len(t)] for t in texts] + +EncoderRegistry.register("dummy", DummyEncoder()) + +memories = [ + Memory(namespace="Company", name="Jane Doe", entity_label="Memory", embedding=[7.0]), + Memory(namespace="Company", name="John Doe", entity_label="Memory", embedding=[7.0]), +] +config = SearchConfig(encoder="dummy", top_k=5, hybrid_weights=(0.5, 0.5)) +results = search("Jane", memories, namespace="Company", config=config) +``` + +## CLI Ingest + +```bash +meshmind ingest \ + --namespace "Company" \ + --instructions "Extract personnel facts as Memory objects." \ + path/to/files +``` + +The command reads all text files, extracts memories, runs the preprocessing pipeline, and stores the results using the configured Memgraph connection. + +## Maintenance Tasks (Optional) + +If Celery and Redis are available, import `meshmind.tasks.scheduled` to register three periodic tasks: `expire_task`, `consolidate_task`, and `compress_task`. Each task fetches memories through `MemoryManager` and applies the corresponding pipeline helper. + +## Limitations & Next Steps + +- Relationship/edge storage is not implemented; only nodes are persisted. +- Custom entity models are not instantiated during extraction—only `entity_label` names are validated. +- Advanced retrieval techniques (regex search, LLM reranking, external vector databases) are not implemented yet. +- Several modules assume optional dependencies (`mgclient`, `tiktoken`, OpenAI SDK) are installed and configured. + +Refer to `PROJECT.md`, `ISSUES.md`, and `PLAN.md` for the detailed roadmap. diff --git a/PLAN.md b/PLAN.md new file mode 100644 index 0000000..395063d --- /dev/null +++ b/PLAN.md @@ -0,0 +1,27 @@ +# Next Steps Plan + +## Phase 1 – Stabilize the Existing Surface +1. **Adopt `NEW_README.md`** as the public README and archive the legacy version so expectations match reality. +2. **Fix critical dependency traps**: + - Wrap `tiktoken` imports in `core/utils.py` and `pipeline/compress.py` with guards. + - Update `OpenAIEmbeddingEncoder` to use `.data` from the modern SDK and add tests. + - Lazily instantiate `MemgraphDriver` (e.g., in `store_memories` or via a factory) to avoid hard crashes when mgclient is missing. +3. **Auto-register embeddings**: when MeshMind starts, register an `OpenAIEmbeddingEncoder` using `settings.EMBEDDING_MODEL` so extraction works immediately. +4. **Repair the test suite**: modernize mocks for the OpenAI client, remove references to `Memory.pre_init`, and add coverage for the dependency fallbacks above. + +## Phase 2 – Deliver Promised Graph Features +5. **Implement entity/predicate registration APIs** on `MeshMind`, backed by `meshmind.models.registry`. Persist this metadata in Memgraph or an internal registry. +6. **Add triplet storage**: extend the pipeline to transform memory metadata into node/edge upserts (subject–predicate–object) and expose `add_triplet` / `add_memory` helpers. +7. **Provide graph retrieval helpers**: implement functions to fetch memories by namespace, list predicates, and query neighbors directly via the driver. + +## Phase 3 – Enhance Retrieval & Automation +8. **Expand retrieval modes**: add vector-only search, regex/exact match filters, and an optional LLM reranker stage configurable through `SearchConfig`. +9. **Integrate retrieval with storage**: allow `MemoryManager` or the driver to return ranked results instead of requiring callers to preload all memories. +10. **Strengthen maintenance workflows**: initialize Celery tasks lazily with logging, ensure expiry/consolidation/compression can operate when Memgraph is down (e.g., retry/backoff). + +## Phase 4 – Developer Experience & Observability +11. **Align tooling**: reconcile Python version requirements, add dev dependencies (`ruff`, `isort`, `black`) to a `dev` extra, and wire CI to run lint/tests. +12. **Document architecture**: keep `SOT.md` updated, add diagrams or sequence charts, and document configuration/operations in the README. +13. **Add logging and metrics**: instrument CLI ingestion, pipeline stages, and graph operations for debugging and future monitoring. + +Revisit this plan after Phase 1 to adjust scope based on effort estimates and stakeholder priorities. diff --git a/PROJECT.md b/PROJECT.md new file mode 100644 index 0000000..5d7e725 --- /dev/null +++ b/PROJECT.md @@ -0,0 +1,40 @@ +# Project Overview + +MeshMind is positioned as a knowledge management toolkit that combines large language models (LLMs) with a property graph store. The current codebase delivers a minimal pipeline for turning unstructured text into `Memory` records, performing lightweight preprocessing, and persisting them through a Memgraph-compatible driver. Retrieval utilities provide lexical, fuzzy, and simple hybrid (vector + lexical) ranking for in-memory collections of `Memory` objects. Supporting modules include a CLI ingest command, Celery task stubs for maintenance, and helper utilities for embeddings and similarity calculations. + +## Current Capabilities + +- **Memory data model** – `meshmind.core.types.Memory` defines the canonical shape of a stored memory, including namespace, entity label, metadata, embeddings, timestamps, importance score, and optional TTL. `Triplet` and `SearchConfig` models support relationship payloads and retrieval configuration. +- **Client façade** – `meshmind.client.MeshMind` wraps together an OpenAI client, an embedding model name, and a `MemgraphDriver`. It exposes helpers for extraction, deduplication, importance scoring, compression, and persistence by delegating into the pipeline modules. +- **Extraction pipeline** – `meshmind.pipeline.extract.extract_memories` orchestrates OpenAI function-calling against the `Memory` schema, enforces allowed entity labels, populates namespaces, and computes embeddings via the encoder registry. +- **Preprocessing utilities** – `meshmind.pipeline.preprocess` contains functions for deduplication (by name or embedding similarity), default importance scoring, and token-aware compression delegation. `meshmind.pipeline.compress` provides a truncation-based compressor (requires `tiktoken`). +- **Storage adapter** – `meshmind.pipeline.store.store_memories` iterates `Memory` objects and calls `GraphDriver.upsert_entity`; `meshmind.api.memory_manager.MemoryManager` provides CRUD helpers over an injected graph driver. +- **Graph driver** – `meshmind.db.memgraph_driver.MemgraphDriver` implements `GraphDriver` using `mgclient`, exposing `upsert_entity`, `upsert_edge`, `find`, `delete`, and a naïve in-memory cosine similarity search fallback. +- **Retrieval helpers** – `meshmind.retrieval` includes TF-IDF “BM25” search, RapidFuzz-based fuzzy matching, metadata/namespace/entity-label filters, and a hybrid scorer that fuses embedding cosine similarity with lexical scores (requires encoders to be registered). +- **Maintenance tasks** – `meshmind.tasks.scheduled` declares Celery beat schedules (if Celery/Redis are available) for expiry, consolidation, and compression. The tasks call the corresponding pipeline helpers through a `MemoryManager` when the Memgraph driver can initialize. +- **Command line ingest** – `meshmind.cli.__main__` exposes an `ingest` subcommand that crawls files/folders, extracts memories, preprocesses them, and stores the results. +- **Examples and tests** – `examples/extract_preprocess_store_example.py` demonstrates the manual pipeline, while `meshmind/tests` supplies pytest suites covering extraction mocking, preprocessing, retrieval, similarity, and driver behavior (many rely on monkeypatching dummy dependencies). + +## Broken or Incomplete Capabilities + +- The high-level client lacks the API surface described in the original README (no `register_entity`, `register_allowed_predicates`, `add_memory`, `store_memory`, or `add_triplet`). +- Relationship management is effectively missing: `store_memories` only upserts nodes, and there is no path that persists edges or triplets. +- Retrieval coverage is partial. There is no dedicated vector-search entrypoint, no regex or exact-match search, and no LLM reranking despite README claims. +- Encoder registration is entirely manual; without pre-registering an encoder name that matches `settings.EMBEDDING_MODEL`, extraction fails with `KeyError`. +- `MeshMind` initialization unconditionally requires `mgclient`; without the package or a running Memgraph instance, the client raises `ImportError` at construction time. +- `meshmind.pipeline.compress` assumes `tiktoken` is installed even after setting `tiktoken = None` on ImportError; calling `compress_memories` without the library causes an `AttributeError`. +- `meshmind.core.utils` imports `tiktoken` at module import time without guarding for missing dependencies. +- `OpenAIEmbeddingEncoder.encode` assumes OpenAI responses behave like dictionary payloads; the current OpenAI SDK returns objects with attribute access, leading to runtime errors. +- Celery tasks import and initialize `MemgraphDriver` at module scope; with `mgclient` missing, they fall back to `None`, but any later access to `manager` silently no-ops. +- Tests in `meshmind/tests` assume pytest fixtures and dummy hooks that do not exist in the production code (`Memory.pre_init`, `openai.responses.create` signatures, etc.), leaving the suite non-executable without additional scaffolding. + +## Future Opportunities / Roadmap Sketch + +- Restore the full high-level API promised in the README: entity registration, predicate management, triplet storage, and multi-level accessors (extract/store, add_memory, add_triplet). +- Expand graph persistence to write relationships alongside nodes, and surface query helpers for graph traversals. +- Provide first-class retrieval APIs (vector search via driver, regex/exact match filters, optional LLM rerank stage, pagination, scoring metadata). +- Harden dependency management: lazy-load optional packages, supply fallbacks or mocks, and document required services (Memgraph, Redis). +- Improve encoder ergonomics by auto-registering defaults (OpenAI embeddings) and offering local encoder options without manual setup. +- Build comprehensive integration tests that exercise real pipelines against temporary Memgraph/Redis containers (or provide in-memory substitutes). +- Update documentation to align with the implemented surface area and clearly communicate configuration, limitations, and extension points. +- Provide developer tooling parity (lint/format commands that match declared dependencies, realistic Python version constraints) and CI pipelines. diff --git a/RECOMMENDATIONS.md b/RECOMMENDATIONS.md new file mode 100644 index 0000000..16df38d --- /dev/null +++ b/RECOMMENDATIONS.md @@ -0,0 +1,12 @@ +# Recommendations + +1. **Reconcile documentation and API** – Either implement the registration/triplet APIs promised in the legacy README or rewrite the public documentation (including the package `readme`) to match the current surface area. The new `NEW_README.md` can serve as the basis. +2. **Refactor driver initialization** – Delay `MemgraphDriver` creation until it is actually needed and provide a pluggable in-memory driver for development/testing. This keeps the package usable even when Memgraph/mgclient are unavailable. +3. **Harden optional dependencies** – Wrap `tiktoken`, Celery, and OpenAI imports in lazy loaders with graceful fallbacks. Update `meshmind.pipeline.compress` and `meshmind.core.utils` accordingly, and add automated tests for missing dependency scenarios. +4. **Auto-register embeddings** – Ship a helper that registers an `OpenAIEmbeddingEncoder` (and perhaps a SentenceTransformer encoder) based on configuration, so extraction and hybrid search work without manual setup. +5. **Implement relationship storage** – Introduce pipeline steps that translate `Memory` metadata into graph relationships, extend the driver to upsert nodes/edges atomically, and expose helper APIs for managing predicates. +6. **Expand retrieval capabilities** – Provide first-class APIs for vector-only search, regex/exact-match filters, LLM-based reranking, and integrate retrieval with the graph driver (so consumers are not required to preload all memories). +7. **Stabilize testing** – Modernize the test suite to match the current OpenAI SDK and code structure, add fixtures for mgclient/tiktoken absence, and introduce integration tests (possibly via docker-compose) to validate the ingest-to-store flow. +8. **Align tooling guidance** – Update `pyproject.toml`, README, and CONTRIBUTING with consistent Python version requirements, and list development dependencies (ruff, isort, black) or provide a `dev` extra. +9. **Document architecture** – Maintain a Source of Truth (see `SOT.md`) and keep it updated as the code evolves. Include diagrams or flow descriptions that explain how extraction, preprocessing, storage, and retrieval modules interact. +10. **Improve observability** – Add structured logging around CLI ingestion, Celery tasks, and driver operations, making it easier to diagnose failures (e.g., `manager` falling back to `None`). diff --git a/SOT.md b/SOT.md new file mode 100644 index 0000000..cdc3d94 --- /dev/null +++ b/SOT.md @@ -0,0 +1,141 @@ +# MeshMind Source of Truth + +This document describes the current structure, components, and data flow of the MeshMind codebase. It is intended to help new contributors understand how the pieces fit together and where to extend the system. + +## Repository Layout + +``` +meshmind/ +├── api/ # MemoryManager CRUD adapter +├── cli/ # Command line entry points (meshmind ingest) +├── client.py # High-level MeshMind façade +├── core/ # Config, embeddings, similarity, types, utilities +├── db/ # Graph driver abstractions (Memgraph) +├── models/ # Registry helpers for entities/predicates (unused) +├── pipeline/ # Extraction, preprocessing, storage, maintenance steps +├── retrieval/ # In-memory search strategies and filters +├── tasks/ # Celery app bootstrap and scheduled jobs +├── tests/ # Pytest suites (require monkeypatching) +└── examples/ # Extraction/preprocess/store example script & notebook +``` + +Supporting files include `README.md` (legacy), `NEW_README.md` (current state), `pyproject.toml` (packaging), `Makefile` (dev commands), and `docker-compose.yml` (placeholder; not wired to services). + +## Configuration (`meshmind/core/config.py`) + +- Reads environment variables for Memgraph (`MEMGRAPH_URI`, `MEMGRAPH_USERNAME`, `MEMGRAPH_PASSWORD`), Redis (`REDIS_URL`), OpenAI API key, and default embedding model name (`EMBEDDING_MODEL`). +- Uses `python-dotenv` if available to load a `.env` file. +- Exposes a singleton `settings` object consumed throughout the codebase. + +## Data Models (`meshmind/core/types.py`) + +- `Memory`: Pydantic model with fields `uuid`, `namespace`, `name`, `entity_label`, optional `embedding`, arbitrary `metadata`, timestamps (`created_at`, `updated_at`), `importance`, `ttl_seconds`, and optional `reference_time`. +- `Triplet`: Represents a subject–predicate–object relation (currently unused by the pipeline). +- `SearchConfig`: Configures retrieval (`encoder`, `top_k`, `rerank_k`, `filters`, `hybrid_weights`). + +## Client (`meshmind/client.py`) + +- `MeshMind` constructor wires up: + - `self.llm_client`: `openai.OpenAI()` instance unless supplied. + - `self.embedding_model`: from settings or override. + - `self.driver`: `MemgraphDriver(settings.MEMGRAPH_URI, ...)` unless a custom driver is injected. +- Exposes thin wrappers that delegate to pipeline modules: + - `extract_memories` → `meshmind.pipeline.extract.extract_memories` + - `deduplicate`, `score_importance`, `compress` → `meshmind.pipeline.preprocess` + - `store_memories` → `meshmind.pipeline.store.store_memories` +- **Important**: No higher-level APIs (registering entities/predicates, storing triplets) exist yet. + +## Pipeline Modules (`meshmind/pipeline/`) + +1. **Extraction (`extract.py`)** + - Builds an OpenAI Responses API call with function-calling against the `Memory` JSON schema. + - Requires a list of allowed entity types (only the class names are checked) and a registered embedding encoder. + - Post-processes the function-call output, injects the namespace, validates via `Memory(**entry)`, and populates embeddings. + +2. **Preprocessing (`preprocess.py`)** + - `deduplicate`: Removes duplicates by exact name match and, when `threshold >= 0.5`, by cosine similarity over embeddings. + - `score_importance`: Sets missing importance scores to `1.0`. + - `compress`: Delegates to `compress_memories`; catches `ImportError` and generic exceptions to fall back to original memories. + +3. **Compression (`compress.py`)** + - Uses `tiktoken` to truncate memory `metadata['content']` to `max_tokens` (default 500). Requires `tiktoken.get_encoding('o200k_base')`. + +4. **Consolidation (`consolidate.py`)** + - Groups memories by name and retains the one with the highest `importance` (ties fall back to first seen). + +5. **Expiry (`expire.py`)** + - Iterates `MemoryManager.list_memories()`, compares `created_at + ttl_seconds` with `datetime.utcnow()`, and calls `delete_memory` for expired entries. + +6. **Store (`store.py`)** + - Iterates an iterable of memories, extracts their dict representation, and invokes `GraphDriver.upsert_entity`. Relationships are not handled. + +## Graph Layer (`meshmind/db/`) + +- `base_driver.py`: Defines the `GraphDriver` abstract base class (`upsert_entity`, `upsert_edge`, `find`, `delete`). +- `memgraph_driver.py`: + - Imports `mgclient` and connects to Memgraph via Bolt using the configured URI. + - Implements `upsert_entity` (MERGE on `uuid`), `upsert_edge` (MERGE between nodes identified by UUID), `find` (executes arbitrary Cypher and returns list of dicts), `delete` (DETACH DELETE by uuid), and `vector_search` (loads all embeddings and ranks via cosine similarity in Python). + - Raises `ImportError` at instantiation time when `mgclient` is missing. + +## Embeddings & Similarity (`meshmind/core/embeddings.py`, `similarity.py`) + +- `EncoderRegistry`: simple class-level registry mapping string names to encoder instances (`register`, `get`). Nothing is pre-registered. +- `OpenAIEmbeddingEncoder`: wraps the OpenAI Embeddings API with retries; assumes dictionary-style responses (`response['data']`). +- `SentenceTransformerEncoder`: wraps `sentence-transformers` for local embeddings. +- `similarity.py`: provides `cosine_similarity` and `euclidean_distance` utilities using NumPy. + +## Retrieval (`meshmind/retrieval/`) + +- `filters.py`: filter helpers by namespace, entity labels, and metadata exact matches. +- `bm25.py`: TF-IDF vectorizer + cosine similarity to approximate BM25 scoring. +- `fuzzy.py`: RapidFuzz-based WRatio scoring. +- `hybrid.py`: Computes query embeddings via a registered encoder, fuses with BM25 scores using weights in `SearchConfig`, and returns ranked tuples. +- `search.py`: Dispatcher functions `search` (hybrid with filters), `search_bm25`, and `search_fuzzy`. +- **Limitation**: All retrieval operates on in-memory lists supplied by the caller; there is no integration with the graph driver. + +## CLI (`meshmind/cli/`) + +- `__main__.py`: CLI bootstrap with `ingest` subcommand. +- `ingest.py`: Walks provided paths, reads text files, runs extraction/preprocessing/store via `MeshMind`. Uses `entity_types=[Memory]` by default. + +## Tasks (`meshmind/tasks/`) + +- `celery_app.py`: Creates a Celery app configured with Redis if `celery` is installed; otherwise provides a no-op dummy app. +- `scheduled.py`: + - Attempts to instantiate `MemgraphDriver` and `MemoryManager` at import time; on failure sets them to `None`. + - Configures Celery beat schedules for expiry (daily), consolidation (every 6 hours), and compression (every 12 hours). + - Defines tasks `expire_task`, `consolidate_task`, and `compress_task` that operate on `manager` if available, else return immediately. + +## API Layer (`meshmind/api/memory_manager.py`) + +- Provides CRUD-style methods `add_memory`, `update_memory`, `delete_memory`, `get_memory`, and `list_memories` by delegating to the injected `GraphDriver`. +- Converts Pydantic models to dicts via `memory.dict(exclude_none=True)` when possible. + +## Models (`meshmind/models/registry.py`) + +- Defines `EntityRegistry` and `PredicateRegistry` for registering Pydantic models and allowed predicate labels. +- Currently unused by the rest of the system. + +## Examples & Tests + +- `examples/extract_preprocess_store_example.py`: Demonstrates extraction, preprocessing, and storage using the CLI components. +- `meshmind/tests`: Contains pytest modules covering extraction, preprocessing, driver behavior, retrieval, similarity, and maintenance tasks. Many tests rely on dummy dependencies or outdated OpenAI interfaces; the suite requires extensive monkeypatching to run. + +## External Dependencies + +- **Memgraph + mgclient**: Required for `MemgraphDriver`. Without mgclient, most functionality that instantiates `MeshMind` will fail. +- **OpenAI SDK**: Used for extraction (Responses API) and embeddings. +- **tiktoken**: Needed for compression utilities and token counting (also imported by `meshmind.core.utils`). +- **RapidFuzz, scikit-learn, numpy**: Power the retrieval modules. +- **Celery + Redis**: Optional, used only if scheduled maintenance tasks are activated. +- **sentence-transformers**: Optional embedding encoder. + +## Known Gaps & Caveats + +- No default encoder registration; consumers must call `EncoderRegistry.register` before invoking extraction or hybrid search. +- Relationship storage (`Triplet`, predicate registry) is unimplemented; only nodes are persisted. +- Driver initialization is eager and fails without Memgraph/mgclient. +- Optional dependencies are not guarded consistently, leading to import-time crashes when missing. +- Tests require updates to match current library behavior and to run without external services. + +Maintain this document as the system evolves to keep onboarding friction low.