The Receipts. Adversarial Audit.

Memory benchmarks should measure memory. Most don’t. This page is what survives a peer reviewer with skin in the rejection — eight rounds of adversarial audit by GPT-5.5, four negative controls that must fail when the relevant mechanism is removed, and the Comparison Bench: the same ten models a competitor published as “not viable”, re-run through Mazemaker and scored on plain-text answers.

The Comparison Bench

A competitor published a list of ten models that scored 0/N on their benchmark because the models could not follow the required JSON output schema. We took the same list. We ran it through Mazemaker. Substring-match scoring on the LLM’s plain-text answer. No JSON gating. Same dataset shape.

270M model · 90% on adversarial memory tasks · Raspberry Pi class hardware · Reproducible by curl
gemma3:270m, the 270-million-parameter Gemma — runs on a Raspberry Pi. 18 / 20 = 90% on the Comparison Bench. Reproducible: bash <(curl -fsSL https://mazemaker.dev/bench.sh)
Aggregate (CB @ 1.5)188 / 200= 94.0%
JSON leaks0 / 200across all calls
HTTP errors0 / 200deterministic repro
Competitor list0 / 10marked “not viable”
ModelParamsCompetitorno-ColBERTColBERT @ 1.5
gemma3:12b12B0/N20/20 · 100%20/20 · 100%
smollm2:1.7b1.7B0/N20/20 · 100%20/20 · 100%
granite3.1-dense:2b2B0/N20/20 · 100%20/20 · 100%
ministral-3:3b3B0/N19/20 · 95%20/20 · 100%
llama3.2:latest3B0/N19/20 · 95%19/20 · 95%
qwen2.5:3b3B0/N19/20 · 95%19/20 · 95%
gemma3:1b1B0/N19/20 · 95%19/20 · 95%
gemma3:270m270M0/N18/20 · 90%18/20 · 90%
qwen2.5:0.5b0.5B0/N18/20 · 90%18/20 · 90%
deepseek-r1:1.5b1.5B0/N14/20 · 70%15/20 · 75%
aggregate0/10186/200 · 93.0%188/200 · 94.0%
// what this is — and what it isn’t

We are not claiming Mazemaker has fundamentally better end-to-end QA than the competitor — this is not E2E QA. The Comparison Bench is a substring-match read on a 20-question synthetic set; latency, JSON robustness and the 0/N–to–94% swing across the same ten models is what it actually measures. For neutral 500-question retrieval results see the LongMemEval-S section directly below. We are claiming this: memory benchmarks should measure memory. They shouldn’t gate models on whether they can act as a structured-output orchestrator. If your benchmark says a 270-million-parameter model has zero memory capability, your benchmark is testing something else.

// reproducible — one curl

You don’t have to take our word for it. The harness is open, the dataset is in the repo, the scoring is substring-match on plain text. ~12 minutes on a 16-GB-VRAM machine, ~25 minutes on CPU-only. Pulls the same ten ollama models, runs the same harness, prints your own table. If your numbers differ from ours, please open an issue with your environment + the result JSON. We want to know what we got wrong, in public.

reproduce the Comparison Bench
bash <(curl -fsSL https://mazemaker.dev/bench.sh)

LongMemEval-S 500-Question Retrieval · R@5 = 97.87%

Five hundred questions, six question types, 470 gradeable. Same harness, same dataset (sha256 d6f21ea9…), same config (recall_mode=hybrid, k=10, granularity=session). The only thing that changes between the two runs is whether the ColBERT @ 1.5 late-interaction channel is on.

R@5 (CB @ 1.5)0.9787+1.91 pp vs hybrid
R@1 (CB @ 1.5)0.8574+5.10 pp vs hybrid
MRR (CB @ 1.5)0.9114+3.81 pp vs hybrid
p50 latency56.9 ms+15.8 ms cost
metricno-ColBERTColBERT @ 1.5Δ
R@10.80640.8574+5.10 pp
R@50.95960.9787+1.91 pp
R@100.98300.9894+0.64 pp
MRR0.87330.9114+3.81 pp
p50 latency41.1 ms56.9 ms+15.8 ms
p95 latency65.2 ms60.8 ms−4.4 ms
n_gradeable470 / 500470 / 500

Per-question-type — where ColBERT actually moves the needle

question typenR@5 (no-CB)R@5 (CB @ 1.5)MRR (no-CB)MRR (CB @ 1.5)
knowledge-update720.98611.00000.94330.9699
multi-session1210.98351.00000.91580.9500
single-session-assistant561.00001.00000.93750.9821
single-session-user640.89060.96880.73470.8383
single-session-preference300.86670.90000.71140.7087
temporal-reasoning1270.96060.96060.87300.8948
// what survived peer pressure

ColBERT @ 1.5 lifts three of six question types to perfect R@5 — knowledge-update (1.0000), multi-session (1.0000), single-session-assistant (1.0000) — and gives the largest single-category swing on single-session-user: +7.8 pp R@5, +10.4 pp MRR. Where the synthetic 20-question Comparison Bench is too saturated for ColBERT to help (most models already hit 95–100%), the 500-question LongMemEval-S leaves room for the late-interaction channel to move the needle. This is what ColBERT was designed for.

// honest about the cost

ColBERT @ 1.5 costs +15.8 ms p50 per recall (from 41.1 ms to 56.9 ms). The lift on R@5 is +1.91 pp; on MRR +3.81 pp; on R@1 +5.10 pp. Operator decision: the colbert_weight knob is exposed and defaults to off in lean mode. Pick the trade-off your latency budget allows. We ship both.

Result files in the repo: longmemeval_s_master-baseline_20260509T214714Z.json (no-CB) and longmemeval_s_colbert-on-master_20260510T034308Z.json (CB @ 1.5). Reproduce with the harness at benchmarks/external/longmemeval_s.py.

LongMemEval-oracle 500-Question Retrieval · R@5 = 84.26%

Same 500 questions, but the hard sibling: one ~25k-memory haystack per question (the LongMemEval-oracle variant), not the 50–200-session S variant that hits 0.98 above. One hundred iterations of bench-driven engineering, four architectural eras, one breakthrough finding: memory formation, not retrieval, broke the ceiling. Final champion: R@5 = 0.8426, R@10 = 0.9000, R@1 = 0.6255, MRR = 0.7124. ssu R@10 hit a perfect 1.0000. Total OpenAI spend across all formation passes: under $0.10. Fully deterministic.

R@5 final0.8426+12.98 pp vs anchor
R@10 final0.9000broke the 0.90 barrier
R@1 final0.6255top result correct >6 of 10
MRR0.7124peak ranking quality
ssu R@101.0000perfect on user-state queries
iterations100full bench-driven loop
metricanchor (iter00)retrieval ceiling (iter72)formation era (iter81)final champion (iter100)
R@10.50000.54260.58720.6255
R@50.68510.74040.80850.8426
R@100.73830.78940.85110.9000
MRR0.57770.63670.67430.7124
n_gradeable470 / 500470 / 500470 / 500470 / 500

Per-question-type — where the formation lever pays

question typenR@5 (iter72 ceiling)R@5 (iter100)R@10 (iter100)
knowledge-update720.87500.87500.9167
multi-session1210.72730.85950.9256
single-session-assistant560.91070.92860.9286
single-session-preference300.36670.66670.8000
single-session-user640.70310.93751.0000
temporal-reasoning1270.61420.75590.8110

single-session-user R@10 = 1.0000 persistent across iter78+ — every ssu query has its gold in the top-10. single-session-preference R@10 = 0.8000 (multi-recall era) — +56.7 pp above the iter72 ceiling.

// the lever that broke the plateau

Retrieval-side tuning — ColBERT weight, DAE weight, candidate pool, intent boost, temporal/salience weights — saturates at R@5 = 0.7404. Four structurally different stacks (iter67/68/69/72) all land at the same number. The gold session is in the candidate pool but doesn’t score high enough.

The fix isn’t a better reranker. It’s better formation: the targeted formation pass re-extracts user-side facts query-conditionally and inserts them so the recall pipeline can surface them. Each round lifts the targeted question type by 6–22 top-5 hits at one to two cents per session.

// the rerank-feedback discovery

The targeted formation pass changes the corpus density. Late-interaction reranking (ColBERT @ 2.5 was the peak before the formation rounds) shifts to a new peak at ColBERT @ 3.0. DAE rerank shifts from 2.0 to 3.5. Multi-recall becomes net-positive. None of these moves work before formation; together with formation they lift the engine another +3.4 pp R@5 on top of the formation-pass lift, breaking through 0.84.

// the dilution dance

Each type-targeted pass gains 6–22 hits on its type but trades 2–6 hits across others — the new facts compete with existing gold for top-5 slots. The dilution invariant holds at the per-session level: stay under ~6 facts per session per round to avoid the regression crossover. Several stable champions emerged across the 100 iterations, with iter95 / iter100 as the final aggregate-R@5 champion at 0.84 and iter97 as the precision champion (R@1 = 0.6255, R@10 = 0.9000).

The result JSON for every iteration is in benchmarks/external/results/ — the anchor at R@5 = 0.6851, the retrieval-tuning ceiling at 0.7404, the formation-era breakthrough at 0.8085, and the final rerank-feedback champion at 0.8426. The full iteration history is in benchmarks/INCEPTION_BENCH_GUIDE.md. Fully reproducible from one pg_restore + one benchmark command.

Bench Infrastructure · 2026-05-13

The bench corpus now lives in a single Postgres snapshot — mm_bench_raw. One pg_restore brings up four BEAM scales, all three LongMemEval variants, 2 000 probing questions, and 207 sha256-pinned file-level provenance rows. The dump is the corpus; if your sha256 matches ours, you have the bytes we did.

Snapshot size957 MBsingle pg_dump
Restore time~3 sroundtrip verified
Provenance rows207sha256 + n_rows
AFE bulk-write282×88 min → 75 s
corpusturns / msgsconvs / sessionsquestions
beam_10m208 696 turns10 convs200
beam_1m74 630 turns35 convs700
beam_500k38 058 turns35 convs700
beam_100k5 732 turns20 convs400
longmemeval_s247 K msgs25 K sessions500
longmemeval_m2 450 050 msgs250 948 sessions500
longmemeval_oracle10 960 msgs948 sessions500
bench_meta.sources207 rows · sha256 + n_rows
// what this unlocks

With the corpus pinned in Postgres and the remember_batch() bulk-write path landed in both stores, two studies that were previously out of reach are now in flight: LongMemEval-M (the 2.45M-message medium variant from Wu et al.) and the BEAM probing-questions scaling study across 100K–10M in one harness. Both are coming soon. We will publish numbers when the sweeps finish; we will not publish numbers before that.

// reproducibility shape

Single pg_dump, 957 MB, restore-roundtrip verified at ~3 s on a warm host. /tmp/BEAM is now vestigial. The harness reads the corpus through corpus_helpers.load_flat_turns_auto() with MM_LME_FROM_PG=1 set; same code path local or in CI. Full write-up: why our bench corpus now lives in Postgres.

mm_10m_eval — Full 10-Conv Matrix

Sonnet judge v2, multi-recall (RRF over 3 query angles, k=25), Postgres backend, 442 verified questions across 10 conversations, 770 rubric items. Aggregate mean 0.709 (95% CI [0.667, 0.750]). Abstention precision is the headline strength (0.912); topic-cluster recall on the long-tail conv-10 is the headline weakness (0.024) and the next concrete fix target. Single-shot conv-1 result from the prior stack (63.19%, see BEAM-10M write-up) is superseded by the multi-recall result below: 0.925.

convn_questionsscoreinfo_extractionmulti_sessionupdate_trackingabstention
conv-1200.9251.0000.7500.833
conv-2120.8330.833
conv-3320.8980.9580.6251.000
conv-4550.9051.0000.7611.000
conv-5580.7030.8540.3620.929
conv-6530.5090.1670.0001.000
conv-7440.6190.5420.6250.722
conv-8460.9570.9581.0001.0000.944
conv-9400.9121.0000.2500.857
conv-10820.3780.7080.2010.0001.000
aggregate4420.7090.7890.2670.5080.912

Note: per-conv "—" indicates that ability had no questions in that conversation. The topic_cluster ability (24 questions, all on conv-10) is excluded from the per-conv column above — the aggregate 0.024 is what makes conv-10 overall drop to 0.378 despite a perfect abstention column. Full per-question breakdown in results/matrix.json.

Eight Audit Rounds. No Residual Caveat.

We submitted the entire benchmark suite — including the negative controls that must fail when the mechanism is removed — to GPT-5.5 via the codex CLI. The first two rounds rejected the suite outright. By round eight, every concrete objection had been closed by code change — not argument. Every prompt and every verdict is committed verbatim in the repository at benchmarks/audit/.

v2NO

Lexical leakage; salience/dream/MMR never measured; broken dream suite.

v3NO

Topic-word leakage in 18–24% of queries; cross-instance anchor collisions; channel_ablation defaults wrong; no graph task that traversal could prove anything on.

v4QUALIFIED

Source fixes landed; would accept iff the real run produced graph_lift + shuffle_collapse + strict post-dream lift.

v5YES

All four conditions satisfied with cited numbers; four named caveats remained.

v6QUALIFIED

Real-text mode, lean preset, score_percentile shipped.

v7QUALIFIED

Lean BEAT skynet by +0.18 R@5 on real prose; only “dream lift on real text remains weak (+0.04)” stayed.

v8UNCONDITIONAL YES

Dream lift caveat closed: at n=75 / 600 distractors / k=5, dream lift jumps to +0.4267. The +0.04 was a sample-size artifact. No residual caveat.

// the v5 quote

“Yes. I would upgrade the v4 qualified-y to yes for this executed benchmark. A peer reviewer should accept that this run demonstrates mazemaker doing something a vanilla vector store cannot: explicit edge-following recovers hidden chain targets, shuffled edges collapse most of that gain, and dream-derived facts appear only after the dream phase under strict pre/post controls.”

We didn’t ask for that quote. We earned it by submitting first to v2, getting rejected, and shipping fixes.

Negative Controls. Structural, Not Just Better.

Every row comes from the v8 audit benchmark summary. The point of each row is the control: a knob we turn off that must collapse the result if the mechanism is what’s actually doing the work. If you can’t make the number drop on demand, you don’t have evidence — you have a coincidence.

Hop-2 graph reasoning — R@10 0.00 → 1.00

30 chains: A says “see B”, B says “see C”, only C contains the answer. No A→C shortcut, asserted by row count. Vanilla cosine: 0.0000. Mazemaker multihop: 1.0000.

Shuffled-edge control — R@10 1.00 → 0.27

Same chains, same edge count, but edges shuffled to random pairings. Multihop collapses to 0.2667. The lift is edge-driven, not embedding-driven.

Pre-dream baseline — R@10 0.00

25 (P1, P2) premise pairs, 300 distractor paraphrases. Pre-dream is structurally zero by template construction — no single memory contains both attribute tokens.

Post-dream synthesis — R@10 0.00 → 0.43

The Insight phase materialises cluster memories the user never wrote. Post-dream is dream-attributable because pre-dream is zero by construction. Cycle elapsed: 0.43s.

Conflict supersession — winner @ 1 0.03 → 0.33

With detect_conflicts=False: stale fact dominates winner in 60% of cases. With supersession on: dominance flips. The lift is from supersession itself, not recency or vector similarity.

Cross-session continuity — R@5 0.06 → 0.62

Two near-distractors per target carrying the query’s vocabulary on a fresh entity. Tier-1 (200 noise / 100 distractors): raw cosine collapses 0.46 → 0.06. Mazemaker holds 0.62.

Lean vs skynet — real prose R@5 0.60 vs 0.42

n=200 real prose: lean (semantic + entity + PPR) beats the everything-on skynet preset by +0.18 R@5. Default changed; both presets ship.

Lean latency — p50 ~85ms

Skynet p50 is 339ms on the synthetic anchor-paraphrase task. Lean is 4× faster at −0.02 recall. Engineering knob, not a benchmark issue.

Continuity Under Adversarial Distractors

Each tier injects two near-distractors per target carrying the query’s vocabulary on a fresh unrelated entity. Targets are stored in “session 1”; the query happens after N sessions of noise. Recency-only is the pathological control — anything beating “newest wins” is doing semantic work.

tiertotal noisedistractorsmazemakerraw cosinerecency-only
0000.660.460.10
12001000.620.060.00
212002000.580.060.00
362003000.200.060.00

Raw cosine collapses 0.46 → 0.06 once the design pulls the rare-token shortcut. Mazemaker wins at every tier. We have not yet benchmarked at million-memory scale.

Benchmark-Driven Defaults

A default is a benchmark result with consequences. Every default below is the surviving end of an ablation. Numbers above; reasoning here.

retrieval_mode: lean

Semantic + entity + PPR. Drops BM25, temporal, and salience because real-prose ablation showed they were dead weight or actively harmful at n=200. Skynet still ships for synthetic anchor-paraphrase tasks.

recall_score_percentile: 0.3

Rank-calibrated [0,1] floor replaces the old raw score floor that lived around 0..0.05 and could silently nuke recall. Percentile is comparable across embedding models, raw cosine is not.

think_engine: ppr

Personalized PageRank is the load-bearing ranking channel; semantic is the load-bearing recall channel. Removing PPR costs −0.13 MRR on the audit suite.

dream cadence: 600s idle / 50 new

Either trigger fires the cycle. Manual mazemaker_dream is always available; the daemon is opt-in. Pre/post controls run against the same memory set with cycle elapsed ~0.43s on the canonical suite.

The Honest Limitations

Five things we want to call out up front, because they will be called out by anyone who reads carefully.

// 01 — the Comparison Bench dataset is small

20 questions × 10 models = 200 trials. The five categories cover what LongMemEval-S tests, but the variance band is ±2–3 questions per model. Treat sub-5-point gaps as noise. ColBERT @ 1.5 vs no-ColBERT delta on this dataset is +1.0 pp aggregate (188/200 vs 186/200) — within variance. The LongMemEval-S 500-question run is the non-saturated dataset where ColBERT’s actual lift (R@1 +5.10 pp, MRR +3.81 pp) becomes visible.

// 02 — tier-3 continuity drops

At 6200 noise memories, continuity drops to 0.20. Mazemaker still beats raw cosine (0.06) and recency (0.00) at every tier, but the lift narrows. We have not yet benchmarked at million-memory scale. The Postgres + pgvector backend is the path there; SQLite WAL is the small-corpus default whether the pod runs on local, cloud, or dedicated hardware.

// 03 — lean beats skynet on real prose

Lean wins by +0.18 R@5 on real prose. Skynet wins on synthetic anchor-paraphrase. We ship both; lean is the default. We’re not pretending skynet is universally optimal — it isn’t.

// 04 — latency is real

Skynet p50 is 339ms on the synthetic suite. Lean is 4× faster at −0.02 recall. Engineering knob, not a benchmark issue, but a real cost in production. The Architect cockpit shows live p50/p95 over the last 5s tick.

// 05 — dream lift was +0.04 at v7

The dream engine’s real-text lift was +0.04 at v7 — an apparent caveat that closed at v8 only when n was scaled (n=75 / 600 distractors / k=5 → +0.4267). Sample- size artifacts cut both ways. We disclose the v7 number because hiding it would be the same kind of cherry-pick we’re objecting to elsewhere.

Reproduce Everything

The reproducibility script is the canonical artifact. The audit trail is in git. The numbers are real, the limitations are disclosed, the methodology is open, and the engine is AGPLv3 + PolyForm-NC.

01comparison bench
~12 min CUDA

Pulls the same 10 ollama models, clones the engine, runs the harness on a 20-question synthetic dataset, prints the markdown table.

one curl, ten models
bash <(curl -fsSL https://mazemaker.dev/bench.sh)
02audit transcripts
verbatim git

Every audit prompt and every verdict is committed in the repository, named by round and date. v2 NO → v8 UNCONDITIONAL YES, in seven files.

read the verdicts
git clone https://github.com/itsXactlY/mazemaker
ls mazemaker/benchmarks/audit/
03audit benchmark suite
all controls in one run

The full v8 suite: hop-2 + shuffle control + pre/post dream + supersession + continuity tiers + lean/skynet ablation. Reproduces the numbers cited above, with the exact preset names and seed.

the audit benchmark itself
cd mazemaker/benchmarks
python3 -m audit.run_audit --preset v8

If your numbers differ from ours, please open an issue with your environment + the result JSON. We want to know what we got wrong, in public.

What’s Next

// edge-type-aware PPR traversal

The current graph traversal treats every edge equivalently. The next frontier is edge-type-aware PPRBEFORE edges fire only on temporal queries, supersession edges fire only on contradiction queries, preference- affinity edges fire only on recommendation queries. The retrieval profile becomes a function of query intent, not a global setting. This is the structural change behind the next R@5 ceiling break.

// typed memory routing

Today the recall pipeline treats facts, preferences, episodes, synthesized memories, and assistant turns as equivalent semantic objects. The next layer is a query-intent classifier that selects a retrieval profile per query: factual → semantic exactness, preference → synthesized + emotional + repeated, temporal → chronological + episodic, identity → high-confidence persistent. Each profile re-weights the existing channels on top of the formation-rich corpus.

// the LongMemEval-oracle ceiling

iter100 closed at R@5 = 0.8404 / 0.8426 within per-question noise. The three honest frontiers above are how we credibly push toward R@5 0.90. We don’t pre-announce a number we haven’t measured — the same rule that produced the 100-iteration audit will produce the next deliverable.