Memory benchmarks should measure memory.
A note on adversarial AI auditing, why we re-ran the same 10 small LLMs Hindsight evaluated against a JSON-schema protocol, and what 270 million parameters can do under plain-text retrieval.
I. The starting position
Most "AI memory" projects today benchmark themselves on retrieval. Recall@K, MRR, latency. A pile of synthetic queries against a pile of synthetic facts. Numbers go up; a number goes on the marketing page; nobody asks what the number measures.
We didn't want to be that.
So when we set out to benchmark Mazemaker — our memory engine for AI agents — we made one rule: a peer reviewer has to accept the methodology. Not the operator. Not a friendly LLM. A peer reviewer with skin in the rejection.
We submitted the entire benchmark suite to GPT-5.5 via the codex CLI. Eight rounds. The first two rejected the suite outright. By round eight, every objection had been closed by code change — not argument. The verdict was unconditional yes — no residual caveat.
Every audit prompt and every audit verdict is committed verbatim in the repository. You can read them. You can run them yourself. The transcripts are at benchmarks/audit/.
This is what we found.
II. The audit chain (v2 → v8)
| Round | Verdict | The objection that got fixed |
|---|---|---|
| v2 | NO | Lexical leakage; salience/dream/MMR never measured; broken dream suite |
| v3 | NO | Topic-word leakage in 18-24% of queries; cross-instance anchor collisions; channel_ablation defaults wrong; no graph task that traversal could prove anything on |
| v4 | qualified-yes | Source fixes landed; would accept iff the real run produced graph_lift + shuffle_collapse + strict post-dream lift |
| v5 | YES | All four conditions satisfied with cited numbers; four named caveats remained |
| v6 | qualified-yes | Real-text mode + lean preset + score_percentile shipped |
| v7 | qualified-yes | Lean BEAT skynet by +0.18 R@5 on real prose; only "dream lift on real text remains weak (+0.04)" stayed |
| v8 | UNCONDITIONAL YES | Dream lift caveat closed: at n=75 / 600 distractors / k=5, dream lift jumps to +0.4267. The +0.04 was a sample-size artifact. No residual caveat. |
The v5 quote, which became our headline:
"Yes. I would upgrade the v4 qualified-y to yes for this executed benchmark. A peer reviewer should accept that this run demonstrates mazemaker doing something a vanilla vector store cannot: explicit edge-following recovers hidden chain targets, shuffled edges collapse most of that gain, and dream-derived facts appear only after the dream phase under strict pre/post controls."
— codex v5 verdict, 2026-04-28
We didn't ask for that quote. We earned it by submitting first to v2, getting rejected, and shipping fixes.
III. The numbers that survived the audit
Hop-2 graph reasoning (the headline test)
Build 30 chains where memory A says "see the B planner", B says "see the C executor", and only C contains the answer. Add explicit (A→B) and (B→C) edges. No A→C shortcut, asserted by row count.
raw_cosine : R@10 = 0.0000 ← vanilla cannot traverse
nm_skynet : R@10 = 0.9333
nm_multihop : R@10 = 1.0000 ← perfect on hop-2
[control: shuffle the chain edges to random pairings, same edge count]
multihop_ctrl : R@10 = 0.2667 ← collapses without real edges
graph_lift_vs_raw = +1.0000. shuffle_collapse = +0.7333. The lift is edge-driven, not embedding-driven.
This is the structural argument. Vanilla cosine does not solve hop-2 reasoning by construction — there is no high-similarity bridge from A's text to C's text when the question token is on A and the answer token is on C. Adding the right edges fixes it. Shuffling the edges to wrong pairings collapses it. The mechanism is doing the work, not the embedding model accidentally helping.
Pre/post dream synthesis
Build 25 (P1, P2) premise pairs about the same entity. Inject 300 distractor paraphrase memories. Pre-dream, no single memory contains both attribute tokens. Pre-dream is 0.00 by template construction.
pre-dream post-dream lift
single_doc_both_tokens_rate : 0.00 0.00 0.00
derived_fact_hit_rate : 0.00 0.32 +0.32
derived:* memories : 0 12 +12
dream cycle elapsed : 0.43 s
The Insight phase materialises cluster memories the user never wrote. Multihop retrieval surfaces them. Pre-dream is structurally zero, so any post-dream signal is dream-attributable.
Continuity under adversarial distractors
Each tier injects 2 near-distractors per target carrying the query's vocabulary on a fresh unrelated entity. Targets are stored in "session 1"; the query happens after N sessions of noise.
| tier | total noise | distractors | nm | raw cosine | recency-only |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0.66 | 0.46 | 0.10 |
| 1 | 200 | 100 | 0.62 | 0.06 | 0.00 |
| 2 | 1200 | 200 | 0.58 | 0.06 | 0.00 |
| 3 | 6200 | 300 | 0.20 | 0.06 | 0.00 |
Raw cosine collapses 0.46 → 0.06 once the design pulls the rare-token shortcut. Mazemaker wins at every tier. Recency-only is the pathological control — anything beating "newest wins" is doing semantic work.
Conflict supersession
Two arms: with the supersession algorithm, and a control with detect_conflicts=False.
winner @ 1 loser_above_winner
with_supersession 0.3333 0.1333
control (no supersession) 0.0333 0.6000
───────── ─────────────────
supersession_lift +0.3000 +0.4667 (collapsed)
Without the supersession algorithm, the stale fact dominates the new one in 60% of cases. With supersession on, dominance flips. The control proves the lift is from supersession itself, not from recency or vector similarity.
IV. The Comparison Bench — the same 10 small LLMs, different protocol
Hindsight publishes an AI-memory benchmark with open methodology and a public leaderboard at benchmarks.hindsight.vectorize.io — 91.4% on LongMemEval-S with Gemini-3 Pro, well-documented runs end to end. We’re fans of how transparently they put their methodology out there.
On the same page they list 10 small open-source models that scored 0/N against their structured JSON-output protocol — the models couldn’t reliably produce a parseable JSON object matching the required schema:
✗ gemma3:1b ✗ deepseek-r1:1.5b
✗ gemma3:12b ✗ granite3.1-dense:2b
✗ gemma3:270m ✗ llama3.2:latest
✗ qwen2.5:0.5b ✗ ministral-3:3b
✗ qwen2.5:3b
✗ smollm2:1.7b
We took the same 10 models and ran them through Mazemaker on the same dataset (a 20-question synthetic set covering the five LongMemEval categories: factual, multi-session, temporal, entity-tracking, multi-step). Substring-match scoring on the model’s plain-text answer — we asked in English instead of demanding JSON.
The results:
| Model | Hindsight | no-ColBERT | ColBERT @ 1.5 |
|---|---|---|---|
| gemma3:12b | 0/N | 20/20 (100%) | 20/20 (100%) |
| smollm2:1.7b | 0/N | 20/20 (100%) | 20/20 (100%) |
| granite3.1-dense:2b | 0/N | 20/20 (100%) | 20/20 (100%) |
| ministral-3:3b | 0/N | 19/20 (95%) | 20/20 (100%) |
| llama3.2:latest | 0/N | 19/20 (95%) | 19/20 (95%) |
| qwen2.5:3b | 0/N | 19/20 (95%) | 19/20 (95%) |
| gemma3:1b | 0/N | 19/20 (95%) | 19/20 (95%) |
| gemma3:270m | 0/N | 18/20 (90%) | 18/20 (90%) |
| qwen2.5:0.5b | 0/N | 18/20 (90%) | 18/20 (90%) |
| deepseek-r1:1.5b | 0/N | 14/20 (70%) | 15/20 (75%) |
Aggregate: 186/200 = 93.0% (no-ColBERT) / 188/200 = 94.0% (ColBERT @ 1.5, 0 errors deterministic). Zero JSON leaks across all 200 calls in both runs — the structured-output question simply isn’t being asked on this protocol. The two benches measure different capabilities of the same models, which is exactly the point of running them side by side.
gemma3:270m is Google's smallest production-deployed LLM. 270 million parameters. Fits in 500 MB of RAM. Runs on a Raspberry Pi. With Mazemaker: 18/20 = 90% in both ColBERT-on and ColBERT-off conditions. With Hindsight: zero. ColBERT @ 1.5 doesn't move the needle on this 270M model in this saturated 20-question dataset — but on the harder 500-question LongMemEval-S below, ColBERT does the work it was designed for.
The deepseek-r1 outlier (70%/75%) is honest: that model emits a <think>...</think> chain-of-thought preamble before its answer, which truncates at our default num_predict=256 budget. Real model limitation, not retrieval-attributable. We disclose it because hiding it would be the same kind of methodological failure we're calling out.
The reproducibility story behind ColBERT @ 1.5
The first ColBERT-on run scored 168/200 with 24 HTTP-500 errors. Every error came from ollama, not Mazemaker — the in-process ColBERT model and ollama were fighting for VRAM. Two surgical fixes in benchmarks/external/comparison_bench.py:
- Top-of-file: hide CUDA from the bench python (
CUDA_VISIBLE_DEVICES="",MM_COLBERT_DEVICE=cpu). Now nothing competes with ollama for the GPU. - New
ollama_evict(model)sendingkeep_alive: 0between models. No more stacked-VRAM OOM.
Re-run: 0 / 200 errors, 188/200 correct, deterministic. That's the reproducibility guarantee — same numbers on your machine as on ours.
V. LongMemEval-S 500-question retrieval — the non-saturated benchmark
LongMemEval-S is the public 500-question memory benchmark from Wu et al. (ICLR 2025). Six question types, multi-session haystacks averaging ~48 sessions per question. We ran it twice through the same harness, same dataset (sha256 d6f21ea9…), same config (recall_mode=hybrid, k=10, granularity=session). The only thing that changes between the two runs is whether the ColBERT @ 1.5 late-interaction channel is on.
| Metric | no-ColBERT | ColBERT @ 1.5 | Δ |
|---|---|---|---|
| R@1 | 0.8064 | 0.8574 | +5.10 pp |
| R@5 | 0.9596 | 0.9787 | +1.91 pp |
| R@10 | 0.9830 | 0.9894 | +0.64 pp |
| MRR | 0.8733 | 0.9114 | +3.81 pp |
| p50 latency | 41.1 ms | 56.9 ms | +15.8 ms |
| p95 latency | 65.2 ms | 60.8 ms | −4.4 ms |
| n_gradeable | 470 / 500 | 470 / 500 | — |
ColBERT @ 1.5 lifts three of six question types to perfect R@5 — knowledge-update, multi-session, single-session-assistant — and gives the largest single-category swing on single-session-user (+7.8 pp R@5, +10.4 pp MRR). This is what late-interaction retrieval was designed for: the synthetic 20-question Comparison Bench is too saturated for ColBERT to help (most models already hit 95–100%); the 500-question LongMemEval-S leaves room for the channel to move the needle.
Result files in the repo:
longmemeval_s_master-baseline_20260509T214714Z.json(no-ColBERT)longmemeval_s_colbert-on-master_20260510T034308Z.json(ColBERT @ 1.5)
VI. What this is — and what it isn't
We are not claiming Mazemaker has fundamentally better end-to-end QA than Hindsight. The numbers above are retrieval (does the right memory rank in the top-K) — that's upstream of full E2E QA. We don't have neutral end-to-end QA numbers on LongMemEval-S yet, and we'd rather not make claims we can't immediately back.
We are claiming this:
Memory benchmarks should measure memory. They shouldn't gate models on whether they can act as a structured-output orchestrator. Different layer of the stack. Different evaluation. If your benchmark says a 270-million-parameter model has zero memory capability, your benchmark is testing something else.
You don't have to take our word for it. You can run it yourself:
bash <(curl -fsSL https://mazemaker.dev/bench.sh)
One curl. ~12 minutes on a 16-GB-VRAM machine. Pulls the same 10 models, runs the same harness, prints your own table. If your numbers differ from ours, please open an issue with your environment + the result JSON. We want to know what we got wrong, in public.
VII. The honest limitations
Five things we want to call out up front, because they will be called out by anyone who reads carefully:
- The Comparison Bench dataset is synthetic and small. 20 questions × 10 models = 200 trials. The five categories cover what LongMemEval-S tests, but the variance band is ±2-3 questions per model. Treat sub-5-point gaps as noise.
- Continuity at tier 3 (6200 noise memories) drops to 0.20. Mazemaker still beats raw cosine (0.06) and recency (0.00) at every tier, but the lift narrows. We have not benchmarked at million-memory scale.
- Lean BEATS skynet on real prose, by +0.18 R@5. That means our default skynet preset over-includes weak channels. The lean preset codifies the right mix; we ship both. We're not pretending skynet is universally optimal — it isn't.
- Latency is real. Skynet p50 is 339ms on the synthetic anchor-paraphrase task. Lean is 4× faster at -0.02 recall. This is an engineering knob, not a benchmark issue, but it's a real cost in production.
- The dream engine's real-text lift was +0.04 at v7 — an apparent caveat that closed at v8 only when n was scaled. Sample-size artifacts cut both ways. We disclose the v7 number because hiding it would be the same kind of cherry-pick we're objecting to elsewhere.
VIII. What's next
The 500-question LongMemEval-S retrieval numbers above (R@5 = 0.9787, MRR = 0.9114 with ColBERT @ 1.5) are the upstream signal. The downstream end-to-end QA pipeline — building the prompt from the top-5 memories, scoring with a judge model on the full LongMemEval-S evaluation harness — is the next deliverable. Same methodology, same repo, same disclosures.
We're also implementing Dream-Augmented Embeddings (DAE) — a Mazemaker-only feature that uses idle dream-cycle compute to give each memory a second embedding, weighted toward the consolidated context the memory formed during NREM. Hindsight, Letta, A-MEM, and Cognee don't offer this today — none of them ship autonomous multi-phase consolidation as part of the retrieval path. Whether DAE produces a measurable lift over the existing ColBERT/hybrid fusion (already at R@5 = 0.9787) is an empirical question we will answer publicly, with controls, in a follow-up.
The reproducibility script is the canonical artifact of this post. The audit trail is in git. The numbers are real, the limitations are disclosed, the methodology is open, and the engine is AGPLv3 + PolyForm-NC.
If you'd like to try Mazemaker on your own agents, the easy path is the onboarding wizard — one-line installer, license-managed, the full Pro stack (ColBERT @ 1.5, DAE, Stage S synthesis, the Architect UI):
Onboard Pro — free during launch
The community engine is also AGPLv3 + PolyForm-NC and lives at github.com/itsXactlY/mazemaker — clone it, build it, run it. The community build ships hybrid recall, lightweight three-phase dream consolidation (NREM + REM + Insight, pre-iter00 form), CLI & MCP. ColBERT, DAE, Stage S synthesis, dream-worker autonomy, and the Architect UI are Pro-only.
The labyrinth, not the cloud.