Reverie
Persistent AI memory achieving 94.6% on LongMemEval
#2 globally. A two-layer architecture that gives AI systems long-term memory with explicit knowledge-update tracking.
Key Findings
Within 0.27 points of #1. Statistically tied at this sample size (p > 0.05).
Oracle experiments show LongMemEval is model-dominated. Architectures add single-digit points over a strong baseline.
Four additional layers were built, tested, and removed. Simplicity won. The minimal system produced the best result.
LongMemEval Leaderboard
Top systems ranked by overall accuracy (n=500, GPT-4o judge). All scores use the standard evaluation methodology.
| Rank | System | LLM | Score |
|---|---|---|---|
| 1 | Mastra OM | GPT-5-mini | 94.87% |
| 2 | Reverie | Claude Sonnet 4.6 | 94.6% |
| 3 | Mastra OM | Gemini 3 Pro | 93.27% |
| 4 | Hindsight | Gemini 3 Pro | 91.4% |
| 5 | Emergence AI | GPT-4o | 86.0% |
| Oracle baseline | GPT-4o | 82.4% |
How It Works
Store everything, abstract on top. L1 keeps every conversational turn verbatim. L2 extracts declarative facts and detects when new information updates old information (supersession detection). Both layers are searched with hybrid vector + keyword retrieval.
Supersession tracking is the key architectural contribution. When "I have 3 cats" becomes "I have 4 cats," the system detects the update, marks the old fact as outdated, and ensures the synthesis model uses the correct value. This drives the 97% score on knowledge-update questions.
Four additional layers were tried and cut (association, abstraction, identity, prediction) along with contextual embeddings, weight decay, and LLM-declared edges. Every component had to earn its place through measured improvement. The paper documents what failed and why.
The Saturation Finding
An Oracle experiment (same synthesis model with perfect retrieval) scores 93.4%, meaning architecture contributes just +1.2 points. This isn't unique to Reverie: Mastra OM adds just +1.8 over its Oracle baseline.
LongMemEval is approaching saturation as a differentiator of memory architectures. At ~115K tokens, its archives fit within modern context windows. The benchmark tests whether your LLM is good, not whether your architecture is good.
The field needs evaluation at scales where architecture genuinely matters.
What's Next
Current benchmarks test whether AI can remember a fact from last Tuesday. The harder problem is whether AI can connect your chess habit to your competitive streak to your career ambitions: the kind of associative reasoning humans do effortlessly across hundreds of interactions.
Scale-first benchmark
400+ sessions, repeated queries, archives exceeding context windows. Designed to test where architectural choices become decisive.
Vocabulary bridging
The primary unsolved retrieval challenge: connecting "model kits" to "1/16 scale German Tiger I" when there are no shared keywords. Graph-based association and entity co-reference at scale.
Associative memory
An association layer using entity co-reference and experiential edges was built and tested. It was net-negative on LongMemEval's small archives, but the hypothesis remains: at scale, with repeated queries, graph-based retrieval becomes necessary.