MemFuse — Memory Infrastructure for LLMs
Open-source memory infrastructure for LLMs with cognitive architecture, intelligent buffering, and unified search—solving the statelessness problem that plagues every AI application

Problem
Large Language Models are fundamentally stateless. Every API call operates in isolation, and when context windows fill up (4K-128K tokens), all conversation history, user preferences, and critical context vanish. This creates four critical problems:
- LLMs forget everything beyond their context window—no persistent learning across sessions
- Developers waste tokens and money resending entire chat histories with every API call
- Every team rebuilds memory systems from scratch instead of reusing battle-tested infrastructure
- No standard solution exists—each application implements ad-hoc memory management
Without persistent memory, AI applications become forgetful assistants that start from zero with every conversation.
Approach
MemFuse implements a cognitive architecture inspired by human memory systems, combined with aggressive performance optimizations for production workloads.
Five-Tier Cognitive Memory System
Built the first open-source implementation of hierarchical memory for LLMs:
- M0 (Episodic Layer) — Raw conversation history, like short-term working memory
- M1 (Semantic Layer) — Extracted facts and entities from conversations
- M2 (Knowledge Graph) — Complex relational knowledge with graph structures
- M3 (Procedural Layer) — Learned behavioral patterns and procedures
- MSMG (Multi-Graph) — Advanced graph-based semantic relationships
Advanced Buffer System Architecture
Designed a three-component buffering system that dramatically reduces latency:
- WriteBuffer — Batches small writes into larger operations through write aggregation
- SpeculativeBuffer — Analyzes access patterns and proactively prefetches data before requests
- QueryBuffer — Caches frequently accessed data with intelligent reranking using Reciprocal Rank Fusion (RRF)
A central BufferManager orchestrates data flow across all memory tiers, optimizing for minimal latency at each level.
Unified Cognitive Search
Implemented search fusion combining three modalities:
- Vector search for semantic similarity (pgvector)
- Graph search for relational reasoning
- Keyword search for exact matching
- RRF reranking to surface the most relevant results across all sources
Custom TimescaleDB Implementation
Built a proprietary pgai-like implementation for embedding and vector operations instead of relying on TimescaleDB's official extension, allowing greater control and customization for our specific use case.
Multi-Tenant Architecture
Designed secure isolation with flexible scoping:
- User (required) — Person identity
- Agent (optional) — Specific AI assistant (defaults to "agent_default")
- Session (optional) — Conversation thread identifier
Impact
Developer Experience Revolution
3-line integration with zero-code transparency—memory works without changing existing LLM API patterns:
from memfuse import MemFuse
from memfuse.llm import OpenAI
# Initialize memory scope
memfuse_client = MemFuse()
memory = memfuse_client.init(user="alice")
# Standard OpenAI client with memory automatically enabled
llm_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"), memory=memory)
# All subsequent calls leverage persistent memory
response = llm_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What's Mars gravity?"}]
)Follow-up questions automatically retrieve relevant context without manual history management.
Technical Achievements
- Local-first architecture — No mandatory cloud dependencies or licensing fees
- Apache 2.0 licensed — Fully open source and fork-friendly
- Production-ready testing — Comprehensive test suite with LongMemEval and MSC benchmarks
- 20 GitHub stars — Active development with 143 commits, 2 core contributors
- Framework integrations — LangChain, AutoGen, Vercel AI SDK, direct API support
Performance Optimizations
- 90% token savings through intelligent history management (avoids resending full context)
- Lightning-fast retrieval via speculative prefetching and query caching
- Write aggregation for database efficiency
- Semantic validation using MSC dataset with 95% accuracy threshold
Real-World Use Cases
Deployed across multiple domains:
- Conversational AI with persistent user preferences
- Educational tutors tracking student progress across sessions
- CLI coding assistants learning developer patterns
- Multi-agent systems with shared memory
- Research assistants with deep context retention
Technical Innovations
1. Custom pgai Implementation
Built proprietary embedding and vector operation capabilities instead of relying on TimescaleDB's official extension, enabling fine-grained control over performance characteristics.
2. Speculative Prefetching
Implemented predictive data loading based on access pattern analysis—uncommon in LLM memory systems but critical for sub-100ms retrieval latency.
3. Hierarchical Cognitive Architecture
First open-source memory system to implement a complete five-tier cognitive model (M0-M3 + MSMG) inspired by human memory research.
4. Unified Search Fusion
Seamlessly combines vector, graph, and keyword search with intelligent RRF reranking—most systems only offer vector search.
5. Zero-Code Transparency
Memory layer works transparently without requiring changes to existing LLM API call patterns—just initialize and all subsequent calls automatically leverage memory.
Lessons Learned
- Cognitive architecture beats naive storage — Hierarchical memory tiers (M0-M3) dramatically outperform single-layer approaches
- Buffering is non-negotiable — Without aggressive caching and prefetching, retrieval latency kills conversational flow
- Local-first wins — Developers prefer self-hosted solutions with full data ownership over SaaS memory services
- Integration friction kills adoption — 3-line setup was critical—any more complexity and developers build their own
- Search fusion > single modality — Combining vector + graph + keyword search with RRF reranking surfaces better results than any single approach
Stack and Tools
- Backend: Python (96.4%), PLpgSQL (2.1%), Shell (1.3%)
- Database: TimescaleDB with custom pgai implementation, pgvector for embeddings
- Infrastructure: Docker containerization (
timescale/timescaledb-ha:pg17), Poetry for dependency management - Testing: Nox for automation, LongMemEval and MSC benchmarking frameworks
- Code Quality: Pre-commit hooks, Flake8, darglint
- APIs: OpenAI, Anthropic, Gemini, Ollama integrations
- Future Support: Neo4j, Qdrant, Redis backends