knowledge-base/CLAUDE.md
Pratik Narola 0f0addb36b chore: migrate to mem0ai v2.0.2 (V3 memory pipeline)
Pin mem0ai[nlp]==2.0.2 and fastembed for the new hybrid-search pipeline.
Drop OSS graph memory (removed upstream in 2.0.0, PR #4805): remove Neo4j
service, env vars, volumes, and driver deps; mark /graph/relationships
deprecated. Rewrite Memory.search/get_all/chat/health call sites to use
the v2 filters={} + top_k API (entity IDs at top level now raise
ValueError). Tighten MCP remove_memory ownership check to O(1)
verify_memory_ownership so it doesn't silently truncate at the new
top_k=20 default. Downgrade base image to python:3.12-slim for spaCy.

Adds scripts/migrate_qdrant_to_v3.py (scroll+upsert with per-user count
parity check) and docs/MIGRATION_RUNBOOK.md covering snapshot, dump,
collection rebuild, cutover, and rollback procedures.
2026-05-23 14:49:45 +05:30

9 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project overview

FastAPI backend that wraps the mem0ai SDK (pinned to mem0ai[nlp]==2.0.2 — the "V3 memory pipeline") to expose memory operations (add/search/update/delete, plus memory-aware chat) over a REST API, an OpenAI-compatible /v1/chat/completions endpoint, and an MCP server. Memory is stored in Qdrant (vectors + BM25 sparse), with a sister {collection}_entities Qdrant collection auto-created by mem0 for entity linking. Embeddings come from a local Ollama instance; the LLM is a custom OpenAI-compatible endpoint. The frontend is two standalone HTML files (index.html, graph.html) that call the API directly — no build step.

Common commands

# First-time setup (creates volumes, builds, brings up stack). Prompts to reset volumes if they already exist.
./setup.sh

# Day-to-day
docker compose up -d --build      # rebuild + start
docker compose down               # stop (keeps volumes)
docker compose down -v            # stop + delete data
docker compose logs -f backend    # tail backend logs (structlog JSON)
docker compose restart backend    # pick up code changes (no --reload; see "Volumes" gotcha)

# Sanity check — assumes a host route to backend:8000 exists (see "Networking" gotcha).
curl http://localhost:8000/health

# Integration tests — hit the running stack, no mocks. See test_integration.py for the test list.
MEM0_API_KEY=<key-from-API_KEYS> python test_integration.py
MEM0_API_KEY=<key-from-API_KEYS> python test_integration.py -v

There are no unit tests and no separate lint/format/type-check setup — test_integration.py is the only test entry point and it requires a fully-running Docker stack. The script generates a fresh TEST_USER = f"test_user_{int(datetime.now().timestamp())}" per run, so for tests to pass auth checks the supplied MEM0_API_KEY must map to that exact user in API_KEYS (either set TEST_USER to a statically mapped user, or add a mapping for the run).

Architecture

Request flow

  1. Client hits FastAPI (backend/main.py) with X-API-Key (or Authorization: Bearer for /v1/chat/completions).
  2. auth.py resolves the key → user_id via settings.api_key_mapping (parsed from the API_KEYS env JSON). Every protected endpoint then verifies the caller's user_id matches the path/body user_id — there is no admin or cross-user access.
  3. The endpoint calls mem0_manager.mem0_manager (singleton in backend/mem0_manager.py), which delegates to the mem0ai SDK. The SDK in turn calls Qdrant, Ollama, Cohere (reranker), and the custom OpenAI endpoint.
  4. The @timed("operation_name") decorator from backend/monitoring.py wraps memory operations to log structured timings and feed the in-memory stats singleton that powers /stats and /stats/{user_id}.

Three parallel API surfaces

All three live in the same FastAPI process and share auth + rate limiting:

  • Native REST (/chat, /memories*, /graph/relationships/{user_id} (deprecated — returns empty payload), /stats*, /models, /users) — authenticates via X-API-Key.
  • OpenAI-compatible (/v1/chat/completions, also /chat/completions) — authenticates via Authorization: Bearer <key> or X-API-Key; supports streaming SSE. Implemented in main.py:openai_chat_completions and stream_openai_response.
  • MCP mounted at /mcp (see backend/mcp_server.py) — uses a Starlette MCPAuthMiddleware that stuffs the resolved user_id into a ContextVar, which the FastMCP tools (add_memory, search_memory, remove_memory, chat) read. The MCP session manager is started inside the main FastAPI lifespan in main.py — mounted-app lifespans don't run automatically, so don't move that startup logic.

Storage layout

  • Qdrant — collection name from QDRANT_COLLECTION_NAME (default mem0). Embedding dim must match the embedder; see "Embedding dimensions" gotcha below. Collections created by mem0 v2 carry a bm25 sparse-vector slot for hybrid search (semantic + keyword + entity-boost). The slot is added automatically at collection creation; existing pre-v2 collections silently degrade to semantic-only with a logged warning — they must be recreated to gain BM25.
  • {collection}_entities — sister Qdrant collection lazy-created by mem0 v2 on first add(), same dimension as the main collection. Stores entity vectors used for ranking boost. No code touches it directly.
  • No graph store — Neo4j and the OSS graph memory feature were removed in mem0ai 2.0.0 (PR #4805). The /graph/relationships/{user_id} endpoint is kept for client compatibility but returns deprecated: true with empty arrays.
  • No SQL store — older docs mention PostgreSQL/pgvector; that's no longer used. Qdrant only.

Important conventions / gotchas

Networking: backend is not published to the host

docker-compose.yml defines the backend on the external npm_network (Nginx Proxy Manager) and only exposes port 8000 inside Docker. There is no ports: mapping. To hit it from the host you need either: (a) the NPM proxy in front of it, (b) docker compose exec backend curl ..., or (c) add a temporary ports: mapping. The npm_network must exist before bringing the stack up (docker network create npm_network if you don't run NPM).

Claude/OpenAI-compatible monkey-patch

mem0_manager.py patches mem0.llms.openai.OpenAILLM.generate_response at import time to clear store and top_p from the config. In mem0ai>=2.0.0, the store half is redundant (upstream made store opt-in) but kept as a harmless safety net. The top_p clearing is still load-bearing: Claude (reached via the custom OpenAI-compatible endpoint) rejects top_p whenever temperature is set, and OpenAILLM sends both unconditionally. If you upgrade mem0ai and chat starts 400-ing on the custom endpoint, this patch is the first place to look.

Embedding model and dimensions are coupled

EMBEDDING_MODEL and EMBEDDING_DIMS in .env / docker-compose.yml must agree, and they must match the dim the Qdrant collection was created with. Defaults are qwen3-embedding:4b-q8_0 / 2560. Switching the model requires either matching dims or recreating the Qdrant collection (./setup.sh → option 2 wipes volumes).

Single-model architecture

Despite what README.md, TESTING.md, and MEM0.md say about intelligent routing across o4-mini / gemini-2.5-pro / claude-sonnet-4 / o3, the code uses one model — settings.default_model (claude-sonnet-4 by default). /models returns only that. Don't reintroduce routing without first checking with the user.

mem0 v2 API: filters dict + top_k

mem0 v2 rejects user_id/agent_id/run_id as top-level kwargs on Memory.search and Memory.get_all (raises ValueError) — they must live inside a filters={...} dict. The limit kwarg is renamed top_k (default reduced 100 → 20 — pass it explicitly when you need more). Memory.add and Memory.delete_all still accept these IDs as top-level kwargs. Use the _build_filters() helper at the top of mem0_manager.py to construct the dict. Search score is now a fused multi-signal value (semantic + BM25 + entity boost), not raw cosine — don't compare against thresholds calibrated for the old scoring.

ADD-only memory algorithm

Memory.add in mem0 v2 only emits ADD events; the engine no longer issues UPDATE/DELETE events based on LLM judgment. Per-user memory count grows monotonically. Explicit Memory.update / Memory.delete still work and are how the project mutates memories.

Auth & rate limiting

  • All endpoints except /health require a valid X-API-Key (or Bearer for the OpenAI-compatible routes). API_KEYS is a JSON object mapping keys → user IDs. Note this contradicts AUTH_SETUP.md, which lists /stats and /models as public — the code is authoritative.
  • Rate limits via slowapi are set per endpoint in main.py decorators: chat 30/min, writes 60/min, reads 120/min, bulk user-delete 10/min. Keyed by API key (fallback to remote IP).
  • Memory ownership is checked via mem0_manager.verify_memory_ownership (O(1) Memory.get(memory_id)) — use this rather than fetching all user memories.

Config field aliases

backend/config.py uses Pydantic AliasChoices so both OPENAI_API_KEY and OPENAI_COMPAT_API_KEY (and OPENAI_BASE_URL / OPENAI_COMPAT_BASE_URL) populate the same field. docker-compose.yml passes OPENAI_API_KEY; .env.example documents OPENAI_COMPAT_API_KEY. Both work.

Volumes mount the source in, but no hot reload

docker-compose.yml bind-mounts ./backend:/app and ./frontend:/app/frontend, and uvicorn runs with --workers 4 and no --reload. Code edits become live only on container restart (docker compose restart backend).

Logging

Use structlog.get_logger(__name__) with keyword arguments (e.g. logger.info("msg", user_id=x)). The last few commits explicitly fixed places that mixed stdlib logging (which silently drops kwargs) with structlog. Don't reintroduce logging.getLogger in this codebase.