Reverie

Persistent AI memory achieving 94.6% on LongMemEval

#2 globally. A two-layer architecture that gives AI systems long-term memory with explicit knowledge-update tracking.

Key Findings

94.6%
LongMemEval (n=500)

Within 0.27 points of #1. Statistically tied at this sample size (p > 0.05).

+1.2
Architectural delta

Oracle experiments show LongMemEval is model-dominated. Architectures add single-digit points over a strong baseline.

2
Layers that matter

Four additional layers were built, tested, and removed. Simplicity won. The minimal system produced the best result.

LongMemEval Leaderboard

Top systems ranked by overall accuracy (n=500, GPT-4o judge). All scores use the standard evaluation methodology.

Rank System LLM Score
1 Mastra OM GPT-5-mini 94.87%
2 Reverie Claude Sonnet 4.6 94.6%
3 Mastra OM Gemini 3 Pro 93.27%
4 Hindsight Gemini 3 Pro 91.4%
5 Emergence AI GPT-4o 86.0%
Oracle baseline GPT-4o 82.4%

How It Works

L2  Facts LLM-extracted knowledge with supersession detection
L1  Experience Lossless raw episode storage

Store everything, abstract on top. L1 keeps every conversational turn verbatim. L2 extracts declarative facts and detects when new information updates old information (supersession detection). Both layers are searched with hybrid vector + keyword retrieval.

Supersession tracking is the key architectural contribution. When "I have 3 cats" becomes "I have 4 cats," the system detects the update, marks the old fact as outdated, and ensures the synthesis model uses the correct value. This drives the 97% score on knowledge-update questions.

Four additional layers were tried and cut (association, abstraction, identity, prediction) along with contextual embeddings, weight decay, and LLM-declared edges. Every component had to earn its place through measured improvement. The paper documents what failed and why.

The Saturation Finding

An Oracle experiment (same synthesis model with perfect retrieval) scores 93.4%, meaning architecture contributes just +1.2 points. This isn't unique to Reverie: Mastra OM adds just +1.8 over its Oracle baseline.

LongMemEval is approaching saturation as a differentiator of memory architectures. At ~115K tokens, its archives fit within modern context windows. The benchmark tests whether your LLM is good, not whether your architecture is good.

The field needs evaluation at scales where architecture genuinely matters.

What's Next

Current benchmarks test whether AI can remember a fact from last Tuesday. The harder problem is whether AI can connect your chess habit to your competitive streak to your career ambitions: the kind of associative reasoning humans do effortlessly across hundreds of interactions.

Scale-first benchmark

400+ sessions, repeated queries, archives exceeding context windows. Designed to test where architectural choices become decisive.

Vocabulary bridging

The primary unsolved retrieval challenge: connecting "model kits" to "1/16 scale German Tiger I" when there are no shared keywords. Graph-based association and entity co-reference at scale.

Associative memory

An association layer using entity co-reference and experiential edges was built and tested. It was net-negative on LongMemEval's small archives, but the hypothesis remains: at scale, with repeated queries, graph-based retrieval becomes necessary.