Reverie

Persistent AI memory achieving 94.6% on LongMemEval

#2 globally. A two-layer architecture that gives AI systems long-term memory with explicit knowledge-update tracking.

Read the Paper PDF

Key Findings

94.6%

LongMemEval (n=500)

Within 0.27 points of #1. Statistically tied at this sample size (p > 0.05).

+1.2

Architectural delta

Oracle experiments show LongMemEval is model-dominated. Architectures add single-digit points over a strong baseline.

Layers that matter

Four additional layers were built, tested, and removed. Simplicity won. The minimal system produced the best result.

LongMemEval Leaderboard

Top systems ranked by overall accuracy (n=500, GPT-4o judge). All scores use the standard evaluation methodology.

Rank	System	LLM	Score
1	Mastra OM	GPT-5-mini	94.87%
2	Reverie	Claude Sonnet 4.6	94.6%
3	Mastra OM	Gemini 3 Pro	93.27%
4	Hindsight	Gemini 3 Pro	91.4%
5	Emergence AI	GPT-4o	86.0%
	Oracle baseline	GPT-4o	82.4%

How It Works

L2 Facts LLM-extracted knowledge with supersession detection

↑

L1 Experience Lossless raw episode storage

Store everything, abstract on top. L1 keeps every conversational turn verbatim. L2 extracts declarative facts and detects when new information updates old information (supersession detection). Both layers are searched with hybrid vector + keyword retrieval.

Supersession tracking is the key architectural contribution. When "I have 3 cats" becomes "I have 4 cats," the system detects the update, marks the old fact as outdated, and ensures the synthesis model uses the correct value. This drives the 97% score on knowledge-update questions.

Four additional layers were tried and cut (association, abstraction, identity, prediction) along with contextual embeddings, weight decay, and LLM-declared edges. Every component had to earn its place through measured improvement. The paper documents what failed and why.

The Saturation Finding

An Oracle experiment (same synthesis model with perfect retrieval) scores 93.4%, meaning architecture contributes just +1.2 points. This isn't unique to Reverie: Mastra OM adds just +1.8 over its Oracle baseline.

LongMemEval is approaching saturation as a differentiator of memory architectures. At ~115K tokens, its archives fit within modern context windows. The benchmark tests whether your LLM is good, not whether your architecture is good.

The field needs evaluation at scales where architecture genuinely matters.

What's Next

Current benchmarks test whether AI can remember a fact from last Tuesday. The harder problem is whether AI can connect your chess habit to your competitive streak to your career ambitions: the kind of associative reasoning humans do effortlessly across hundreds of interactions.

Scale-first benchmark

400+ sessions, repeated queries, archives exceeding context windows. Designed to test where architectural choices become decisive.

Vocabulary bridging

The primary unsolved retrieval challenge: connecting "model kits" to "1/16 scale German Tiger I" when there are no shared keywords. Graph-based association and entity co-reference at scale.

Associative memory

An association layer using entity co-reference and experiential edges was built and tested. It was net-negative on LongMemEval's small archives, but the hypothesis remains: at scale, with repeated queries, graph-based retrieval becomes necessary.