← Lab notes · 2026-05-18 · aLca, Mazemaker

Inception Benchmarking.

We did not set out to build a benchmark. We set out to make Mazemaker measurably better and discovered, three weeks in, that the published memory benchmarks we were trying to beat were not measuring memory. They were measuring JSON-schema compliance, judge mood, and rubric typos. So we wrote our own — the Inception Bench — against the hardest public memory haystack we could find, audited it line by line, removed the LLM judge entirely, pinned the corpus by sha256, and published every artifact. This is the methodology companion to the 100-iteration story. It is also a catalogue of the specific, named, replicated defects in the benchmarks the field has been quoting. We did not pick this fight. The receipts did.

I. The problem with memory benchmarks

“Memory” in an LLM context is a slippery word. Three things share the name. Context window is what the model can hold in a single forward pass. Retrieval is what gets pulled back from a store on demand. Formation is what gets crystallised into the store from raw conversation in the first place. The published benchmarks largely measure retrieval, ignore formation, and call the composite “memory.” That is the first slip.

The second slip is the scoring layer. Most published memory benchmarks grade a long natural-language answer against a long natural-language gold, using an LLM judge. The judge is itself an LLM with its own instruction-following profile, its own biases, and its own per-run variance. The published number is therefore a product of three things — the memory engine, the answering model, and the judge — and the public reporting usually pretends only the first one varies.

The third slip is corpus shape. A “25k-memory haystack” sounds adversarial until you discover the haystack has been pre-chunked into 50–200 short sessions per question, each session sized just-so to fit comfortably into a single retrieval call. The model never has to disambiguate across the whole haystack. It has to disambiguate across a friendly partition. That is a different problem — an easier one — and it is the one most published numbers actually score.

Consider the landscape we walked into. LongMemEval-S, as released, ships the friendly partition: 50–200 short sessions per question, each session sized for direct LLM ingestion. We hit R@5 = 0.9787 on that variant in the first week. That number is a feature of the corpus shape, not of memory quality. BEAM-10M ships an ambitious 10-million-token corpus and a thoughtful rubric — but the rubric has a measurable defect rate, and the published evaluator silently truncates partial credit. Hindsight’s grader scores answers JSON-shape-first; capable small models that produce correct prose but malformed JSON are recorded as 0/N. mm10m, our own earlier in-house attempt, used an LLM judge and produced a 16-percentage-point swing between Haiku-4-5 and gpt-5-nano on the same answers.

None of these are bad-faith efforts. The authors are smart, the corpora are real, and the public discussion has been improved by their existence. But once we replicated each grader and audited its outputs against the underlying corpus, the headline numbers stopped meaning what the headlines said they meant. We needed a benchmark whose number was a property of the engine and the corpus, not of the judge’s mood or the answering model’s JSON discipline. So we wrote one.

II. Specific defects we found in published benchmarks

This is the most important section of this post. Everything that follows — the the Inception Bench design choices, the 100-iteration story, the claim that retrieval saturates and formation breaks through — rests on us being right about what the existing benchmarks measure. If we are wrong about the defects, we are wrong about the rest. So we are going to be specific.

The int() truncation in Hindsight’s ability evaluators. We replicated Hindsight’s published evaluator and ran it on its own gold/prediction pairs. Nine of the ten ability evaluators wrap their partial-credit sum in int(). Python’s int() on a float truncates toward zero. Half-credit answers (0.5) become 0. The headline 64.1% the field has been quoting is therefore artificially deflated by partial credits that exist in the rubric, are awarded by the rubric, and are then silently dropped one statement before the sum. Replace one int() with round() and Hindsight’s own published predictions score approximately 65%. The entire field has been comparing against the wrong number. We documented this on 2026-05-11; the upstream has not corrected it.

The 20% rubric-defect rate on BEAM-10M conv-1. We took conv-1 of BEAM-10M (twenty questions), pulled each question’s gold and its rubric, and audited each one against the source corpus by hand. Then we re-audited blind through two independent judges — Sonnet 4.5 and Qwen 3.6 Plus via the Nous Research Router — with κ = 0.71 inter-judge agreement. Four of twenty questions are defective. The diagnoses are crisp: phantom rubric keyword (the rubric requires a word that does not appear anywhere in the gold session), question-corpus misquote (the question paraphrases a corpus fact incorrectly, so the “correct” answer is the wrong one), empty gold (the gold field is literally the empty string), and compositional hallucination (the rubric demands the engine assemble a fact that is not present in any of the sessions). A 20% defect rate on a single audited conversation is enough to invalidate sub-five-point comparisons on that conversation.

The ten models Hindsight published as “0/N · not viable.” Hindsight’s public table lists ten smaller open models as scoring zero against their bench, with a footnote that the models could not produce structured output reliably. We ran the same ten models against Mazemaker on our Comparison Bench — same memories, same questions, plain-text answers, deterministic substring scoring against the gold. Eight of the ten cleared 90%. The 270-million-parameter Gemma running on a Raspberry Pi cleared 90%. Hindsight’s grader was not measuring whether the model remembered. It was measuring whether the model could obey the grader’s JSON schema. Those are not the same axis.

The 16-point judge spread on identical answers. On the 2026-05-14 mm10m calibration run we scored the same set of engine answers with three different judges: gpt-5-nano-2025-08-07, Haiku-4-5, and Opus 4.7. Haiku and Opus agreed at κ = 0.93 and produced ~0.53 aggregate. gpt-5-nano produced 0.36 on the same answers — a 16-point swing driven entirely by the judge. We then surveyed whether any of the published memory benchmarks had reported this kind of judge-sensitivity audit. None had. Most do not even disclose which judge they used in their headline number.

Haiku-4-5 unfit as rubric judge. Because we had Haiku in the loop for the calibration above, we also tried it as a primary rubric evaluator on BEAM-10M. Its instruction-following compliance on multi-clause rubric application was 0% — not low, zero. It would consistently award credit for clauses it had not actually checked. Models that are excellent at conversation are not automatically excellent at precise rubric application; treating one as a substitute for the other is how 16-point swings happen. We default any prompt-adherence role to Sonnet or higher.

Each defect above has a date, a replication recipe, and a saved artifact. None of them require trusting our taste. They require running the upstream evaluator on the upstream gold and checking what comes out.

III. What we decided to measure instead

Given the above, we built the Inception Bench around four non-negotiable design rules.

Six question types, named explicitly. knowledge-update (a fact the user states, then updates, then is asked about), multi-session (the answer requires bridging information across more than one session), single-session-assistant (the gold is something the assistant said), single-session-preference (the gold is a user preference embedded in conversation), single-session-user (the gold is a concrete user-state fact), temporal-reasoning (the answer requires reasoning about dates, durations, or sequences). Aggregate numbers without per-type breakdown lie. We always report all six.

The oracle variant. We score against LongMemEval-oracle 500q, not LongMemEval-S. The difference is the haystack: oracle hands the engine one ~25,000-memory corpus per question, undivided, no friendly partition. The engine has to find the gold inside the whole haystack on its own. This is the harder sibling of the corpus most benchmarks quote and the reason our oracle numbers (R@5 ≈ 0.84) sit visibly below our S numbers (R@5 ≈ 0.98). The S number is real. It just measures an easier problem.

Determinism as a hard requirement. A benchmark whose run-to-run variance approaches the size of the deltas you are claiming is not a benchmark; it is a coin flip. the Inception Bench is fully deterministic at the seed level: iter61 replicated iter58 to four decimal places on every gold-detection metric, after we had explicitly tried to break it with knob sweeps. iter77 replicated iter75 after we rolled back a regressing round. iter100 replicated iter95 within per-question noise. Any time we ship a delta below ~0.5pp we run two reps; anything ≥0.5pp on n=500 is signal at this seed.

Substring scoring against gold. No LLM in the loop. This is the design choice that took us the longest to commit to and is the one we are most confident about in retrospect. The runner emits one canonical value per question. The judge does case-insensitive substring match against the gold value. There is no rubric-applying language model anywhere in the scoring path. This removes the 16-point judge spread, removes the reasoning-budget panic that killed an entire iteration when nano emitted empty content (max_completion_tokens covering reasoning plus visible output, with medium effort and a 4k cap, gave us blank strings), and removes the prompt-coupling blast radius where changing the answering prompt by a sentence shifts the judge’s grading distribution by a measurable amount. The price is that some questions need a tighter gold definition than the original LongMemEval shipped with; we document each tightening in the gold-correction log.

What we explicitly do not measure: free-form answer fluency, JSON well-formedness, conversational politeness, or any axis a normal eval suite already covers. the Inception Bench measures memory: did the engine surface the right fact, in the top-K, given a query against the full haystack.

IV. Building the Inception Bench: the unflinching engineering history

The first version of the Inception Bench was a Python script that called the engine once per question, accumulated a list of recalls, and pickled them. It worked. It was also unreproducible, slow, and silently buggy in ways that took us three weeks to surface. The current version — benchmarks/inception_bench.py — is the survivor of an unflattering sequence of corrections. We are going to walk through each one because the receipts matter.

The Postgres snapshot path. Originally the corpus lived in SQLite, was rebuilt from a JSON dump each run, and took 88 minutes to populate at 500 sources. After one of us shipped the wrong dump and the other ran a 6-hour sweep against the wrong haystack, we cut over to a single pg_restore from a sha256-pinned dump. The dump is the corpus: 957 MB, restores in ~3 seconds, carries the full LongMemEval triplet plus BEAM scales, and ships with 207 sha256-pinned file-level provenance rows so we can prove which dump produced which JSON result file. If two runs disagree, the first thing we check is the dump hash. We have caught one corpus drift this way that would otherwise have looked like a regression.

The 282× bulk-write refactor. Per-fact remember() calls were the original ingest path. We profiled them and found 95% of the time was in embedding-model warmup and per-row INSERT round-trips. The replacement does one batched embedding pass and one executemany INSERT per source. 88 minutes → 75 seconds for 500 sources. That is not a tuning win, it is a category change — it is the difference between “rebuild the corpus only on cataclysms” and “rebuild the corpus before lunch.” The 100-iteration loop is impossible without this refactor.

The silent os NameError. While we were refactoring the AFE (Atomic Fact Extractor) pipeline, a stale import got dropped and one of the env-var reads went through an os.environ.get() that no longer had os in scope. Python raised NameError, the surrounding try/except swallowed it at log-level DEBUG, and every AFE configuration flag — including the one that controlled which extraction prompt got used — silently defaulted to its hard-coded fallback. Two weeks of A/B comparisons had been comparing the same configuration against itself. We caught it because two configurations that should have produced very different fact counts produced identical fact counts. The fix was one line. The lesson was “never except Exception at DEBUG level around a configuration read.”

The two pg_attribute namespace bugs. Postgres caches column metadata per-schema. When we ran the bench against two schemas in the same process (production vs the Inception Bench-isolated) the embedding-dimension constraint from schema A leaked into schema B’s column lookup, producing inserts that succeeded in development and threw at the boundary. Both bugs were the same shape: a function that took a table name and not a schema-qualified name. The receipt is two PRs both titled “qualify schema explicitly.”

The fast_runner socket-collision race. We tried to parallelise the bench by running fast_runner --max-parallel N. N>1 workers raced on the embedding service’s Unix socket (embed.sock), each binding succeeded on a stale unlinked path, and four of ten convs silently dropped their writes. The aggregate number looked plausible. The per-type breakdown did not. We rolled back parallelism and made mm_10m_eval strictly sequential at the conv level. Any future parallelism on the Inception Bench must happen at the question level inside a single connection, not at the conv level across connections.

The DAE channel silently disabled on every PG bench until 2026-05-14. The engine_config.py helper we shipped to centralise quality-bench configuration was force-setting MM_DAE_ENABLED=0 on the Postgres backend, because at one point DAE had a Postgres-specific bug we had since fixed and forgotten to re-enable. The v5 0.7093 number was therefore produced by an engine with a major rerank channel turned off. After we caught it (by reading the actual env at engine boot, not the config the helper claimed to set) the channel came back online and the headline number moved as a consequence. Every published quality bench between the introduction of engine_config.py and 2026-05-14 was running a crippled engine. We re-ran them.

The judge-swap saga. mm10m originally used gpt-oss-120b as primary judge. It produced templated stub answers when forced into runner mode and graded sloppily as judge. We swapped to gpt-5-nano with reasoning_effort=minimal. This eliminated the templated-stub problem but exposed the reasoning-budget overrun (medium effort + 4k cap = empty content). v11 conv-3 on update_tracking collapsed 0.083 → 0.000 from this bug alone. The eventual answer was not a different judge but a different design: the v14 pivot — bare canonical answer plus deterministic substring match — that became the the Inception Bench scoring contract.

This is not a hagiography. Most of the bugs above existed because we wrote them. The point is that none of the headline numbers the Inception Bench produces today have one of these bugs underneath them, because each one ate a week of our calendar and got pinned with a regression test. The current bench is what is left after the bugs were removed in public.

V. The 100-iteration loop, brutally honest

the Inception Bench is the substrate. The 100-iteration loop is what we did on it. We are going to summarise the trajectory here because the methodology only makes sense in light of what it produced; the long-form story lives in the formation post.

The anchor at iter00 sat at R@5 = 0.6851. The final iter100 sat at R@5 = 0.8404, with the champion iter95 at R@5 = 0.8426 (within per-question noise of iter100). R@10 broke the 0.9000 barrier at iter97. ssu R@10 hit a perfect 1.0000. Total OpenAI spend across the entire loop was under $0.10.

Four architectural eras, each only visible after the prior one finished:

Era 1: retrieval (iter00 → iter72). Twenty-four iterations sweeping the relevance-formula knob surface — channel weights, intent boost, temporal weighting, salience, ColBERT and DAE rerank multipliers, candidate-pool size, multi-angle recall. The era ended with a ceiling proof: four structurally different stacks — iter67, iter68, iter69, iter72 — all landing at exactly R@5 = 0.7404 to four decimal places. Determinism made the proof airtight. Retrieval-side tuning at this corpus state is saturated.

Era 2: formation (iter74 → iter81). The diagnostic question after iter72 was “where is the gold for the missed queries?” and the answer, by direct corpus inspection, was “in the session, but in a shape the embedder cannot find from the query.” The Stage C user-statement rebake on 18 ssp-missing sessions, via gpt-5-nano at roughly $0.001 per call, lifted ssp R@5 from 0.3667 to 0.7000 — +33pp on a single targeted lever. The same lever applied to ssu produced ssu R@10 = 1.0000. The same lever applied to temporal-reasoning produced +18.11pp. Total formation API spend: under five cents.

Era 3: rerank-feedback (iter83 → iter95). This is the era we did not predict. The rebake-enriched corpus had a different density profile, and the rerank weights that were optimal on the pre-formation corpus were no longer optimal. ColBERT shifted from 2.5 → 3.0 (+1.92pp). DAE shifted from 2.0 → 3.5 (+1.21pp). Multi-recall, which had been useless on the pre-formation corpus, became useful (+0.22pp). The composite landed at iter95 R@5 = 0.8426. None of those three moves worked before formation. All three worked after. This is what we mean by “the benchmark and the engine co-evolved” — the optimum moved with the corpus.

Era 4: top-K sharpness (iter96 → iter100). The last five iterations were about R@1 and R@10. iter97 broke the R@10 = 0.9000 barrier with a temporal-weight nudge to 0.9 and set R@1 = 0.6255. iter100 replicated the champion stack as a determinism sanity check and landed at R@5 = 0.8404, inside per-question noise of iter95. The loop closed.

The single most important number in this story is not the headline. It is the iter72 ceiling proof. Four different stacks at R@5 = 0.7404 to four decimals is what told us the retrieval-side knob surface had been mapped. Without that proof we would still be tuning weights. The benchmark, by being deterministic enough to produce identical numbers from non-identical configurations, told us when to stop.

The dilution dance is the other piece worth highlighting. Every type-targeted formation pass gains 6–22 hits on its targeted type and trades 2–6 hits across the others. The new facts compete for top-5 slots. A third permissive ssp pass (“at least 5–10 facts per session”) regressed ssp R@5 by -10pp and we rolled it back atomically. Per-session fact density has a real ceiling around six facts per round at this corpus shape; past that the new facts are too noisy and pull high-signal ones out of top-5. We measured this. We did not theorise it.

VI. What did NOT work

This section is equal length to the wins, because the wins are not credible without the misses. A research log that records only the successes is a marketing document.

Naive canonicalization regressed -3pp. The hypothesis was that collapsing semantically near-duplicate facts into canonical representatives would help the retrieval signal concentrate. iter25 emitted 176 canonicals and the aggregate dropped three points. The diagnosis is that canonical content out-ranks per-session gold while canonical labels do not subsume all golds — you lose the discriminative identity of the underlying fact and replace it with a generic representative. Semantic compression destroys discriminative identity. Most systems silently absorb this loss because they have no instrument to see it. We had the Inception Bench, so we saw it, and rolled the canonicals back.

Salience 3.0 was a catastrophic over-boost. A salience-weight sweep took us from a working 0.5 to 1.0 to 2.0 to 3.0. The 3.0 setting reweighted the relevance formula enough that high-salience-but-off-topic facts started winning over on-topic but lower-salience ones. R@5 dropped substantially. The interesting part is that this regression is only visible on an engine that has a real salience channel doing real work. A cheap vector system, where the salience signal is either absent or floor-pinned, cannot exhibit this regression — which sounds like a feature until you realise it means the channel is not contributing signal in either direction. The presence of a measurable over-boost regression is itself evidence that the channel matters.

bge-reranker-v2-m3 multi-session collapse. We swapped ColBERT 2.0 for bge-reranker-v2-m3 in the v7 sweep, expecting a marginal lift. ms R@5 dropped from 0.7273 to 0.4955. The bge reranker scores cross-session evidence on length-asymmetric pairs in a way that systematically punishes the bridge sessions multi-session queries depend on. Caught by the per-type breakdown. The aggregate number would have hidden it because the gains on other types nearly cancelled the multi-session collapse. Always read the per-type table.

Keyword-strip multi-recall. An early version of multi-recall stripped stop-words and ran the result as a second-angle query. It produced shorter, less specific queries that competed with the primary recall for top-5 slots without contributing differentiating information. Regressed before we switched to a pattern-based deterministic rewriter that produces semantically distinct angles rather than lexically truncated ones. The deterministic rewriter is what eventually became useful in era 3.

before-edges PPR. Personalized PageRank with a temporal “before” edge type was the obvious mechanism for temporal-reasoning lift. We implemented it. It did not fire selectively on temporal queries because the edge_type column was being ignored at traversal time — a function-signature bug where the edge filter was passed but never applied. By the time we caught and fixed it, the temporal-reasoning lift was being delivered by the Stage C time-anchored rebake, so the PPR contribution overlapped almost entirely with the formation gain. PPR stayed in the engine but contributes a fraction of a point at most.

KU rebake iter82. After the ssp/ssu/tr/ms rebakes worked, knowledge-update was the remaining type below aggregate. We ran a rebake. It produced 12 facts. Net aggregate moved -2pp because of cross-type dilution. We rolled back the 12 facts atomically and never re-tried — knowledge-update appears to be retrieval-saturated at the current corpus density and the next move is either a different question framing or a different store-side mechanism (provisional candidate: a dedicated update-chain index, not on the roadmap yet).

Three rounds of permissive ssp prompts. “Extract at least 5–10 facts per session” sounds like more-is-better. It produced -10pp regressions in three separate variants. Rolled back each time. The pattern is that loose extraction prompts produce facts whose embeddings are diffuse and pull high-signal facts out of top-5. The right ssp prompt is short, type-conditional, and aggressively rejective. We arrived at it by elimination.

Six rolled-back changes documented this thoroughly is not a confession; it is the floor for an honest research log. Every measurement system that does not produce a list this long is either lying or not measuring.

VII. The co-evolution claim

the Inception Bench and Mazemaker formed each other. This is not a hand-wavy systems claim. It is a specific sequence.

The bench-corpus moving to Postgres enabled targeted SQL surgery on specific gold sessions. Targeted SQL surgery enabled the Stage C rebake to operate on a precisely named subset of sessions, with namespace tags that made rollback atomic. Atomic rollback meant we could try formation interventions without risking the rest of the corpus — cheap experiments instead of expensive cataclysms. Cheap formation experiments broke the retrieval ceiling at iter74. Breaking the retrieval ceiling revealed that the rerank weights optimal on the pre-formation corpus were no longer optimal on the post-formation corpus — the corpus had a different density profile, and the rerank optimum shifted with it. That became era 3.

Each iteration the bench got harder to beat and the engine got better at beating it. Iter72’s ceiling proof would have looked like a permanent wall in a system where the corpus could not be surgically modified; in our system it became the threshold between era 1 and era 2. Iter83’s rerank-feedback discovery would have been invisible in a system where the corpus could not change density between iterations; in our system it was the threshold between era 2 and era 3. The benchmark is part of the engine. The engine is part of the benchmark. We did not plan this. We noticed it around iter80 and named it.

The practical consequence is that the Inception Bench is not the kind of benchmark you train against once and then frame on the wall. It is the kind of benchmark you keep open in another terminal because the act of working with it is the work. The deterministic substrate makes that possible; the LLM-judge alternative would have made it impossible. Every move would have been contaminated by judge variance and you would never know if a 0.4pp lift was real.

VIII. Why others will break on this bench

This section earns the right to a measured arrogance. We are not going to name competitors. The receipts will name them by exclusion.

Systems gated on JSON output will keep scoring 0/N on capable small models. The Hindsight “not viable” ten became eight-of-ten ≥90% the moment we removed the JSON-shape gate. The 270M-param Gemma on a Raspberry Pi cleared 90% on Comparison Bench because all it had to do was emit a plain-text canonical value and let the bench do the substring match. That is not a memory difference between the engines. It is a UI difference between the graders. Any benchmark that scores JSON parsing as part of its memory score is measuring two things and reporting one. Don’t conflate them.

Systems with LLM judges will keep publishing numbers that swing 16 points depending on which judge you trust. We measured this directly on identical answers (κ = 0.93 between Haiku-4-5 and Opus 4.7, gpt-5-nano outlier at -16pp). The only durable way to remove judge variance is to remove the judge. Pick a judge or pick a benchmark; you cannot have both. Substring scoring against a tight gold is uglier to design but it produces numbers that are the same on Tuesday as they were on Monday.

Systems without graph structure cannot exhibit salience over-boost regressions. This is the most diagnostic line in the post. If your retrieval is pure vector with a flat top-K, and someone “increases salience weight” in your relevance formula, nothing measurable happens — because there is no salience channel doing differentiating work in the first place. We saw a real -multi-point regression at salience 3.0 specifically because the channel was contributing real signal in both directions. Sounds like a flaw. Is the opposite: it is evidence the channel exists.

Systems that treat formation as a black box will keep hitting plateaus that look like “retrieval is hard” but are actually “your corpus does not have the fact in a shape your embedder can find.” The iter72 ceiling is the field’s ceiling. We climbed past it by acknowledging that the question is upstream of retrieval. A research program that does not own its formation pipeline cannot, even in principle, run the move that produced the +33pp ssp lift.

The Hop-2 reasoning lift is structurally impossible in vector-only stores. One of the lab side-experiments measured Hop-2 reasoning (the answer requires composing two facts that share no surface tokens but share a graph neighbour). Pre-dream-consolidation R@10 on Hop-2: 0.00. Post-dream-consolidation R@10 on Hop-2: 1.00. Vector databases cannot do this. Not “haven’t done it yet.” Cannot. Hop-2 requires a graph traversal across an edge type that emerged from consolidation; cosine similarity over a flat index has no mechanism to find it. This is not a benchmark trick. It is the load-bearing reason graph structure is in the engine.

The honest part. We do not claim #1 on benchmarks we did not build. We claim our number on the benchmark we built — and we publish every artifact so anyone with the engine can rerun it. If a peer ships against the Inception Bench and beats our 0.8426, we will say so out loud. If a peer ships their own benchmark, we will run our engine on it. Mazemaker is the original on the category, the receipts are all in public repos, and the only reason we built the Inception Bench is that the existing public benchmarks are broken in ways their authors have not corrected. None of that is hostile. All of it is documented.

IX. Reproducibility & the open challenge

Everything in this post is reproducible. Every iteration’s configuration, every per-type metric, every rolled-back change is on disk in the public repo.

Result JSONs: benchmarks/external/results/loop-iter00.json through loop-iter100.json. Each file contains the full per-type breakdown, the per-question gold/prediction pairs, and the configuration that produced it.
Iteration history: benchmarks/external/history.tsv — one row per iteration, all 100, with the lever changed and the delta.
Reproduction recipe: benchmarks/INCEPTION_BENCH_GUIDE.md.
Corpus snapshot: the pinned pg_restore dump with sha256 in the guide.

The minimal recipe:

git clone https://github.com/itsXactlY/mazemaker
cd mazemaker
# 1. restore the pinned bench corpus
sha256sum -c benchmarks/external/corpus.sha256
pg_restore -d mazemaker_bench benchmarks/external/corpus.dump
# 2. run the Inception Bench against the champion stack
python benchmarks/inception_bench.py \
    --config benchmarks/external/configs/iter95.yaml \
    --out  benchmarks/external/results/replication.json
# 3. compare to our published champion
python benchmarks/external/compare.py \
    --baseline benchmarks/external/results/loop-iter95.json \
    --candidate benchmarks/external/results/replication.json

A successful replication should land within ±0.5pp of R@5 = 0.8426 on n=500 (the per-question granularity of this corpus is ~0.2pp per question; sub-0.5pp deltas need two reps to be claimed). If you cannot replicate, file an issue with the diff between your result JSON and ours; we will treat it as a bug in our recipe until proven otherwise.

The open challenge is implicit in the publication. Every team that has shipped a memory benchmark has an engine. Our engine is on the Inception Bench at 0.8426. The recipe is above. If a peer engine beats 0.8426 on this corpus, the right move is to publish the result JSON and tell us. We will link it from this page.

The only benchmark worth claiming a #1 on is one you did not build. So we built one and did not claim a #1 on it. We just shipped the receipts.

Every empirical claim in this post traces to a result JSON in benchmarks/external/results/loop-iter*, a commit SHA, or a dated audit memory. the Inception Bench runs against LongMemEval-oracle 500q with one ~25,000-memory haystack per question, deterministic seed, substring scoring against gold, no LLM judge. Champion: iter95 / iter100 at R@5 = 0.84xx, R@10 = 0.9000 (iter97), R@1 = 0.6255 (iter97). The engine that produced these numbers ships as Mazemaker.