Part 1: strip leading newlines from chat_with_memory's LLM response
(minimax-m2 reasoning output leaks blank lines; .lstrip() at the source
covers /chat, /v1/chat/completions, and the MCP chat tool).
Part 2: replace the /v1/chat/completions handler with an httpx-based
pass-through proxy that preserves every upstream field (tool_calls,
reasoning_tokens, system_fingerprint, finish_reason, etc.) and supports
end-to-end MCP-style tool calling.
What changed:
- models.py: OpenAIChatCompletionRequest is now permissive — typed for
the common fields (tools, tool_choice, parallel_tool_calls,
response_format, max_completion_tokens, seed, stream_options,
reasoning_effort, modalities, etc.) and extra='allow' for forward-
compat. The typed response models (OpenAIChatCompletionResponse and
friends) are deleted — the handler returns upstream's JSON dict
directly so unknown fields aren't silently dropped.
- mem0_manager.py: adds httpx.AsyncClient + an openai_proxy_completion()
method that injects a "Relevant memories" system message only when
the last role is 'user' AND no tool flow is in progress, then forwards
to the upstream LLM. Non-stream returns upstream JSON; stream returns
an async iterator that yields raw upstream SSE bytes verbatim while
side-channel-parsing for the post-stream mem0.add. Codifies the
Memori #434 lessons: never mutates existing messages (only prepends
system), never touches tool_call_id, runs post-add even on mid-stream
error via try/finally.
- main.py: handler is now ~50 lines — model_dump(exclude_unset) the
request, hand off to openai_proxy_completion, return dict OR wrap in
StreamingResponse. response_model=None so FastAPI doesn't validate.
Deleted stream_openai_response (post-hoc word-chunking is gone).
Lifespan shutdown closes mem0_manager.async_http.
Research confirmed mem0 itself does not ship an HTTP /v1/chat/completions
(only the in-process mem0.proxy.main.Mem0 SDK pattern), so we replicate
the pattern without adding a litellm dependency. SSE/tool_calls patterns
are modeled after microsoft/agent-lightning's llm_proxy.
Verified locally: ast.parse OK on all three files. End-to-end smoke tests
will run on beast.
- Add auth to /models and /users endpoints
- Add rate limiting to all endpoints (10-120/min based on operation type)
- Fix 11 info disclosure issues (detail=str(e) -> generic message)
- Fix 2 silent except blocks with proper logging
- Fix 7 raise e -> raise for proper exception chaining
- Fix health check to not expose exception details
- Update tests with X-API-Key headers and security tests