MemFuse — Memory Infrastructure for LLMs

Problem

Large Language Models are fundamentally stateless. Every API call operates in isolation, and when context windows fill up (4K-128K tokens), all conversation history, user preferences, and critical context vanish. This creates four critical problems:

LLMs forget everything beyond their context window—no persistent learning across sessions
Developers waste tokens and money resending entire chat histories with every API call
Every team rebuilds memory systems from scratch instead of reusing battle-tested infrastructure
No standard solution exists—each application implements ad-hoc memory management

Without persistent memory, AI applications become forgetful assistants that start from zero with every conversation.

Approach

MemFuse implements a cognitive architecture inspired by human memory systems, combined with aggressive performance optimizations for production workloads.

Five-Tier Cognitive Memory System

Built the first open-source implementation of hierarchical memory for LLMs:

M0 (Episodic Layer) — Raw conversation history, like short-term working memory
M1 (Semantic Layer) — Extracted facts and entities from conversations
M2 (Knowledge Graph) — Complex relational knowledge with graph structures
M3 (Procedural Layer) — Learned behavioral patterns and procedures
MSMG (Multi-Graph) — Advanced graph-based semantic relationships

Advanced Buffer System Architecture

Designed a three-component buffering system that dramatically reduces latency:

WriteBuffer — Batches small writes into larger operations through write aggregation
SpeculativeBuffer — Analyzes access patterns and proactively prefetches data before requests
QueryBuffer — Caches frequently accessed data with intelligent reranking using Reciprocal Rank Fusion (RRF)

A central BufferManager orchestrates data flow across all memory tiers, optimizing for minimal latency at each level.

Unified Cognitive Search

Implemented search fusion combining three modalities:

Vector search for semantic similarity (pgvector)
Graph search for relational reasoning
Keyword search for exact matching
RRF reranking to surface the most relevant results across all sources

Custom TimescaleDB Implementation

Built a proprietary pgai-like implementation for embedding and vector operations instead of relying on TimescaleDB's official extension, allowing greater control and customization for our specific use case.

Multi-Tenant Architecture

Designed secure isolation with flexible scoping:

User (required) — Person identity
Agent (optional) — Specific AI assistant (defaults to "agent_default")
Session (optional) — Conversation thread identifier

Impact

Developer Experience Revolution

3-line integration with zero-code transparency—memory works without changing existing LLM API patterns:

from memfuse import MemFuse
from memfuse.llm import OpenAI
 
# Initialize memory scope
memfuse_client = MemFuse()
memory = memfuse_client.init(user="alice")
 
# Standard OpenAI client with memory automatically enabled
llm_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"), memory=memory)
 
# All subsequent calls leverage persistent memory
response = llm_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's Mars gravity?"}]
)

Follow-up questions automatically retrieve relevant context without manual history management.

Technical Achievements

Local-first architecture — No mandatory cloud dependencies or licensing fees
Apache 2.0 licensed — Fully open source and fork-friendly
Production-ready testing — Comprehensive test suite with LongMemEval and MSC benchmarks
20 GitHub stars — Active development with 143 commits, 2 core contributors
Framework integrations — LangChain, AutoGen, Vercel AI SDK, direct API support

Performance Optimizations

90% token savings through intelligent history management (avoids resending full context)
Lightning-fast retrieval via speculative prefetching and query caching
Write aggregation for database efficiency
Semantic validation using MSC dataset with 95% accuracy threshold

Real-World Use Cases

Deployed across multiple domains:

Conversational AI with persistent user preferences
Educational tutors tracking student progress across sessions
CLI coding assistants learning developer patterns
Multi-agent systems with shared memory
Research assistants with deep context retention

Cognitive architecture beats naive storage — Hierarchical memory tiers (M0-M3) dramatically outperform single-layer approaches
Buffering is non-negotiable — Without aggressive caching and prefetching, retrieval latency kills conversational flow
Local-first wins — Developers prefer self-hosted solutions with full data ownership over SaaS memory services
Integration friction kills adoption — 3-line setup was critical—any more complexity and developers build their own
Search fusion > single modality — Combining vector + graph + keyword search with RRF reranking surfaces better results than any single approach

Stack and Tools

Backend: Python (96.4%), PLpgSQL (2.1%), Shell (1.3%)
Database: TimescaleDB with custom pgai implementation, pgvector for embeddings
Infrastructure: Docker containerization (timescale/timescaledb-ha:pg17), Poetry for dependency management
Testing: Nox for automation, LongMemEval and MSC benchmarking frameworks
Code Quality: Pre-commit hooks, Flake8, darglint
APIs: OpenAI, Anthropic, Gemini, Ollama integrations
Future Support: Neo4j, Qdrant, Redis backends

MemFuse — Memory Infrastructure for LLMs

Problem

Approach

Five-Tier Cognitive Memory System

Advanced Buffer System Architecture

Unified Cognitive Search

Custom TimescaleDB Implementation

Multi-Tenant Architecture

Impact

Developer Experience Revolution

Technical Achievements

Performance Optimizations

Real-World Use Cases

Technical Innovations

1. Custom pgai Implementation

2. Speculative Prefetching

3. Hierarchical Cognitive Architecture

4. Unified Search Fusion

5. Zero-Code Transparency

Lessons Learned

Stack and Tools