Inside the 100-iteration loop.

One benchmark — LongMemEval-oracle 500q, one ~25,000-memory haystack per question. One hundred iterations from anchor (iter00) to closing champion (iter100). Four architectural eras: retrieval-side tuning saturated at R@5 = 0.7404; query-conditional formation surgery broke through to 0.8085; rerank-feedback at the new corpus density extended to 0.8426 (iter95); top-K sharpening pushed R@10 across 0.9000 (iter97). ssu R@10 sits at a perfect 1.0000. Total OpenAI spend: under $0.10. This is the story of how the bottleneck migrated upward through the cognition stack — and what it cost to follow it.


I. The retrieval-side ceiling

The loop opened at R@1 = 0.5000, R@5 = 0.6851, R@10 = 0.7383, MRR = 0.5777. Those are not bad anchor numbers — they sit above what a well-tuned vector database typically produces on the same corpus — but they are far below the 0.9787 the same engine produces on the smaller LongMemEval-S sibling. The reason is the haystack: oracle gives each question one ~25k-memory corpus rather than fifty to two hundred conversation sessions. The signal is more dilute. Cosine alone cannot solve it. Reranking matters more, intent routing matters more, formation matters more — every channel in the stack gets stress-tested.

The natural first move was to sweep the relevance-formula knob surface. Channel weights between semantic and graph-walk signals. Intent boost coefficients. Temporal decay and salience weighting. ColBERT-v2 rerank multipliers and DAE rerank multipliers. The candidate pool size, the multi-angle recall path, the supersession-edge fall-through. We spent iterations iter00 through iter72 on that surface — sweeping, ablating, A/B-ing, occasionally finding a +0.5pp move worth keeping.

Iter72 produced what looked at the time like a new champion: R@5 = 0.7404, with a clean stack — ColBERT @ 2.5, DAE@2.0, multi-recall on, candidate pool 512, intent boost 0.10, temporal 0.7, salience 0.5. Healthy numbers across types. We celebrated for about a day.

Then we tried to push past it. And the bench refused to move.

Iter67, iter68, iter69, and iter72 are four structurally different stacks. They use different ColBERT weights, different candidate pool sizes, different temporal weightings, different salience floors. They are not parameter neighbours. They are stacks with measurably different per-question rank orderings. And all four land at exactly R@5 = 0.7404 to four decimal places. Not 0.7402 and 0.7406, which would be ordinary stochastic neighbours. The exact same number, four times.

The bench is deterministic. We had already confirmed this: iter61 replicated iter58 bit-for-bit, every per-type metric matching to four decimals. So when four different stacks converge on a single number, that is not random behaviour around an attractor. That is a ceiling. The retrieval-side knob surface has a saturation point, and we had walked the whole surface to find it. The diagnostic question stopped being “what knob next?” and became “where is the gold for the queries we keep missing?”


II. The formation breakthrough

We pulled the 30 single-session-preference (ssp) queries — the worst-performing type at iter72, with R@5 = 0.3667. Nineteen of them had no gold in top-5. We pulled the gold session content for those nineteen and looked, line by line, at what facts the formation pipeline had crystallised.

The canonical example became a photography session. The user spent an hour describing their Sony A7R IV setup, the Godox V1 flash they had just bought, the carbon-fibre tripod they wished was lighter. The future question was “Can you suggest accessories for my current photography setup?” and the gold answer required knowing what the user owned. The corpus had 85 facts crystallised from that session. Every one of them was about flash sync technology, tripod specs, sensor characteristics — third-person factual descriptions. Not one fact said “user owns a Sony A7R IV camera”. The gold was in the corpus. The user-state fact wasn’t.

This was not a retrieval failure. This was a formation failure. The retrieval pipeline cannot find a fact that was never crystallised, and the cosine similarity between a query about “my photography setup” and a description of flash sync technology is structurally weak. Bigger candidate pools, sharper rerankers, fancier intent routers — none of them can synthesise a fact that does not exist in the corpus.

So we built the Stage C user-statement rebake. The methodology is deliberately small. For each question type sitting at a plateau, we identify the gold sessions whose memories aren’t in top-5 for their associated query. We send each gold session, along with the question text, to gpt-5-nano with a query-conditional prompt: “what user-side facts in this conversation would answer this future question?” The model emits a short list of atomic user-state facts. We embed them with the same FastEmbed ONNX model the engine uses end-to-end, and we insert them with a distinct namespace prefix so the round can be rolled back atomically if it regresses.

Round one against ssp ran on 30 gold sessions, produced 47 facts, cost about $0.01, and lifted ssp R@5 from 0.3667 to 0.5667 — six new top-5 hits in a single pass. Aggregate R@5 climbed to 0.7447. Round two used a sharper, less-permissive prompt (emphasis on grounded ownership and concrete preferences, no speculation) and ran another rebake on the residual misses. ssp R@5 jumped to 0.7000. Aggregate R@5 hit 0.7553. R@10 cleared 0.8000 for the first time in the loop.

The lever generalised. The shape of the fix is identical for every question type; only the extraction prompt changes. Single-session-user (ssu) questions are concrete factual queries about the user — “What degree did I graduate with?”, “What breed is my dog?”. We rebaked 22 missed ssu sessions, producing 48 atomic user-state facts. ssu R@5 jumped from 0.6562 to 0.9531, an absolute lift of +29.69pp. ssu R@10 hit a perfect 1.0000. Every ssu question now has its gold in the top-10, and we have not lost that since.

Temporal-reasoning (tr) questions are about dates, durations, and sequences. The extraction prompt produced time-anchored event facts. tr R@5 went from 0.6063 to 0.7874, +18.11pp. Aggregate R@5 crossed the 0.80 stretch target at R@5 = 0.8043. Multi-session (ms) questions span multiple sessions and need cross-session bridges. The pass produced bridge facts that explicitly named the connecting topic. ms R@5 went from 0.7273 to 0.8678, +14.05pp. The formation era closed at R@5 = 0.8085 on iter81. Total OpenAI spend through the entire era: under $0.10. Total wall-clock: about ninety minutes.


III. The rerank-feedback discovery

This is the era of the loop that we did not expect. We had assumed iter81 would be the closing champion — formation breakthrough achieved, knob-surface walked, what else was there? — and the next round of work would be on a different benchmark. Instead, almost by accident, we re-ran the retrieval-side knob sweep against the rebake-enriched corpus. Just to confirm the old optima still held.

They didn’t.

The corpus density had shifted. The formation rebake had added ~500 new atomic user-state facts to the haystack — not many, in absolute terms, against ~25,000 memories per question, but concentrated exactly where the retrieval pipeline used to miss. Those new facts sat at slightly lower base similarities than the pre-formation gold candidates (they are short, atomic, less semantically rich on their own), but they also sat much closer to the query-intent vector. That means rerankers, which redistribute probability mass among candidates already in the pool, suddenly had a richer signal to work with.

ColBERT @ 2.5 had been the pre-formation peak. Bumping ColBERT to 3.0 on the rebake-enriched corpus lifted R@5 by +1.92pp to 0.8277 (iter83). That same move pre-formation regressed by ~0.4pp; we had explicitly ablated it during the retrieval era. It only paid out because the corpus density had changed underneath it. DAE weight 3.5 lifted another +1.21pp to 0.8404 (iter87). The multi-recall path — deterministic pattern-based query rewriting for preference and ownership intents — lifted another +0.22pp to 0.8426 (iter95). New champion. None of these three moves worked before the formation era. Each one only worked because the prior era had reshaped the inputs.

The pattern repeats one layer up. Iter95’s stack — ColBERT 3.0, DAE 3.5, multi-recall on, candidate pool 512 — would have been suboptimal on the iter72 corpus. The iter72 stack would have been suboptimal on the iter81 corpus. The optima are not stationary. They depend on which facts are in the haystack and at what density. Tuning the retrieval-side knobs in isolation, against a fixed corpus, is a local optimisation that gets out-performed by a joint optimisation across formation and retrieval. That is the empirical content of the era-3 insight.

The mechanical explanation is straightforward in hindsight. ColBERT-v2 rerank weight controls how much the late-interaction token-level score redistributes probability mass within the top-N candidate pool. When the pool contains only third-person factual descriptions, the token-level signal between query and candidate is dilute — raising the weight just amplifies noise. Once the pool contains atomic user-state facts whose tokens overlap directly with the query intent (“my camera” vs “user owns a Sony A7R IV camera”), the token-level signal is high and amplification pays off. The same logic applies to DAE: its dense-attention rescore is only as good as the candidate set it rescores. Multi-recall is only as good as the rewrites it produces; the deterministic preference rewriter only produces useful angles when there are preference-shaped facts in the corpus to find. Every rerank-stage move is downstream of formation.

The full iter72 → iter100 trajectory below shows the era boundaries explicitly — the formation column is highlighted because each entry there is a corpus mutation, not a parameter change.

iterEraR@5Note
iter001 — retrieval anchor0.6851R@1 = 0.5000, R@10 = 0.7383, MRR = 0.5777
iter721 — retrieval ceiling0.7404four structurally different stacks land at this exact number
iter742 — ssp rebake v10.7447ssp +20.00pp
iter752 — ssp rebake v20.7553ssp +13.33pp; R@10 = 0.8000 first time
iter782 — ssu factual rebake0.7809ssu +29.69pp; ssu R@10 = 1.0000
iter792 — tr time-anchored0.8043tr +18.11pp; crossed 0.80
iter812 — ms bridge rebake0.8085formation era closes
iter822 — KU rebake (rolled back)0.7885-2pp from cross-type dilution; 12 facts reverted
iter833 — ColBERT 2.5 → 3.00.8277+1.92pp on rebake-enriched corpus
iter853 — + DAE 2.0 → 2.50.8362+0.85pp
iter873 — + DAE → 3.50.8404+0.42pp; rerank plateau forms
iter953 — + multi-recall0.8426three levers combined; R@5 champion
iter974 — temporal 0.90.8340R@10 = 0.9000 first time; R@1 = 0.6255; MRR = 0.7124
iter1004 — champion replication0.8404within per-question noise of iter95; loop closes

IV. The dilution dance

Each type-targeted rebake gains 6 to 22 hits on its targeted type and trades 2 to 6 hits across the other types. The new user-state facts compete with existing gold candidates for top-5 slots. After enough rebakes the trade-offs add up, and the marginal pass can net negative.

Iter82 is the cleanest example. After the four-type rebake sequence closed, we tried a fifth: knowledge-update (KU) questions, where the gold is “the user updated X from A to B on date Y.” The rebake produced 12 facts about user-state changes with date anchors. Net result: R@5 dropped 2pp. The new facts looked like ms bridge facts and tr time-anchored facts; they landed in those types’ candidate pools and pushed the actual ms and tr gold sessions out of top-5. We rolled back the 12 facts atomically — the namespace prefix made it a single SQL DELETE — and the engine returned to the iter81 number on the next run.

The per-session fact cap is real, somewhere around six facts per round per session. We confirmed this the painful way: three rounds of more-permissive ssp prompts (“extract at least 5–10 facts per session”) regressed ssp R@5 by roughly -10pp before we noticed. The new facts at the high end were redundant rephrasings of existing facts, and they crowded the high-signal originals out of top-5 by sheer count. We rolled all three rounds back and pinned the per-session cap.

Each pass is namespaced. Each round is reversible. The benchmark is deterministic, so we can confirm a rollback is clean (iter77 replicated iter75 to four decimals after a regressing round was reverted). The dilution dance is real, and the only reason it is survivable is the rollback discipline.


V. Top-K sharpness — breaking the 0.90 barrier

After iter95, the loop entered a final tuning phase focused not on aggregate R@5 but on the shape of the result list. The question we were asking the engine had changed: we were no longer asking “can you retrieve the gold within five?” — we were asking “can you put the gold at position one, can you put it within ten, and how peaked is the rank distribution?”

Iter97 ran a temporal-weight sweep on top of the iter95 stack. Pushing temporal weight from 0.7 to 0.9 sharpened the position-one bias: R@1 lifted to 0.6255, MRR to 0.7124, and — the headline — R@10 crossed 0.9000 for the first time in the loop. R@5 dropped slightly from the champion (0.8340 vs 0.8426) because the temporal bias traded some mid-rank hits for sharper top-1 hits. That is a deliberate trade. For a downstream agent that consumes top-1 as ground truth, R@1 lift is more valuable than R@5 lift; for an agent that takes top-5 as evidence, the iter95 champion stack is the right choice. We keep both.

Iter100 ran the iter95 champion stack one final time as a confirmation pass. R@5 came in at 0.8404 — one question off iter95’s 0.8426, within per-question noise (one question = 0.2pp on n=500). Per-type metrics matched to three decimals. The loop closed.

ssu R@10 has held at 1.0000 since iter78. ssp R@10 has held at 0.8000 since iter75. tr R@5 has held above 0.78 since iter79. The formation gains are not transient — they are corpus-resident, and they survive every subsequent retrieval-side knob sweep.


VI. What didn't work

The receipts include the failures, at the same length as the wins, because the failures are where the surprising structural facts live.

Naive canonicalization regressed -3pp. The hypothesis was reasonable: if 30 ssp queries miss because the user-state facts are not crystallised in the right shape, why not pre-emit canonical “the user’s preference for X is Y” statements across the whole corpus, not just the queries we know about? We tried it — iter25, emitting 176 canonicals, and the variants in the mid-30s. Net regression. The canonical content was so semantically tight that it out-ranked the per-session gold under cosine similarity, while the canonical labels did not subsume every gold answer the queries actually wanted. The lesson is sharp: semantic compression destroys discriminative identity. A canonical that summarises five sessions wins similarity but loses the per-session distinctions the rubric scores. Only selective, query-conditional rebake — Stage C — lifted; broad-stroke canonicalization regressed.

Salience 3.0 over-boost. The instinct after the rebake worked was to push salience harder — the new facts are high-signal, surely we want them weighted more. At salience 3.0 the rebake facts started winning top-5 slots on queries they weren’t intended for, pushing the non-rebake gold sessions out. Aggregate R@5 dropped by several points. We dialled back to 0.5 and kept it there. Notably, the failure is only observable on a graph engine with a non-trivial relevance formula — on a flat vector store, salience this aggressive would not surface a regression at all because the signal floor is too high.

bge-reranker-v2-m3 caused multi-session collapse. The v7 reranker audit tried swapping ColBERT for bge-reranker-v2-m3 as the secondary rerank stage. Single-session metrics held; multi-session collapsed by 8pp. The bge reranker over-weights surface-form similarity between query and individual session content, which penalises cross-session bridge facts. We stayed on ColBERT.

Keyword-strip multi-recall regressed. An early version of multi-recall stripped stopwords and emitted a bag-of-keywords as the rewrite. It looked clean in isolation and regressed in aggregate, because the bag-of-keywords variant lost the intent signal the deterministic rewriter preserved. The pattern-based _preference_query rewriter we eventually shipped explicitly preserves possessive markers (“my”, “I”) and intent verbs — that is what makes it pay out for ssp specifically.

Before-edges PPR didn’t fire selectively. The hypothesis was that temporal queries would benefit from a PPR walk biased along before-edges. The implementation hit a wall: the PPR walker ignores edge_type and treats all edges as fungible. Before-edges fired on every query, not just temporal ones, and the resulting walk noise washed out the temporal signal. The proper fix is edge-type-aware PPR; we did not build it in the 100-iteration window.

Permissive ssp prompts regressed -10pp. Mentioned above — rounds two-permissive, two-extra-permissive, and three-extra-permissive each regressed, each was rolled back, and the per-session fact cap of ~6 was pinned. The receipt for this lives in three result JSONs that explicitly bear the “_rollback” suffix.

The honest read on the failures is that none of them were bad ideas in isolation. Every one looked promising before we tested it. The reason we can confidently call them failures is that the bench is deterministic, the rollback discipline is atomic, and the comparison is apples-to-apples. None of those properties are available on a stochastic benchmark or a non-rollback-friendly engine. They are part of the cost we paid up front to make the loop interpretable.


VII. The architectural implication

The empirical claim of this loop is not “formation matters more than retrieval.” That framing is too narrow. The claim is that the bottleneck migrated upward in the cognition stack over the course of the 100 iterations, and the migration is not metaphorical — it is observable in the trajectory.

For iterations 00 through 72, the retrieval channel was the bottleneck. Every gain came from tweaking the relevance formula or its inputs — channel weights, intent boost, rerank coefficients, candidate pool, temporal decay. Retrieval-side tuning was the entire game, and the knob surface had a hard ceiling at 0.7404.

For iterations 74 through 81, the formation channel was the bottleneck. The retrieval channel had nothing left to give — it could not surface facts that did not exist in the haystack. Stage C rebake added the missing facts. Retrieval-side parameters were near-frozen during this era; formation was doing the work.

For iterations 83 through 95, the rerank channel was the bottleneck — but only because the formation era had reshaped the corpus density underneath it. The same rerank moves that regressed pre-formation paid out post-formation. The bottleneck had migrated back into a retrieval-side knob, but only after the upstream channel had changed the inputs.

For iterations 96 through 100, the top-K sharpness was the bottleneck — the rank distribution, not the candidate set. Temporal weighting reshuffled top-5 versus top-1 trade-offs.

A flat vector database cannot exhibit this migration, because its cognition stack only has one layer: similarity. A flat vector store has exactly one tuning knob (the embedding model) and one rerank choice (a single reranker). When that ceiling is hit, there is nowhere further to migrate — the only options are a bigger embedding model or a different reranker, and both are global swaps rather than channel-specific moves. Mazemaker’s relevance formula is structurally different: multiple channels with independent semantics, each with its own knob surface and its own ceiling. The 100-iteration trajectory is what it looks like when bottleneck migration is possible. It is the empirical signature of a layered cognition stack.


VIII. What's next

The receipts say the next frontier is upstream of all four eras: the channel selection itself. Three threads are open.

Edge-type-aware PPR. Right now the PPR walker treats every edge as fungible. The before-edges experiment in section VI failed because of this. A walker that knows before-edges should fire on temporal queries and supersession-edges should fire on contradiction-style queries would route the graph walk along the right substrate per query. The before-edges experiment is not dead — it is parked, waiting for the edge-type-aware walker.

Typed memory routing. The query intent classifier already tells us whether a query is ssp, ssu, tr, ms, or ku. The retrieval profile is currently the same across types — same candidate pool, same rerank weights, same channel mix. A typed routing layer would maintain a per-intent retrieval profile, learned from the per-type metric trajectory. Iter95’s aggregate champion stack is not optimal on ssp considered alone; iter79’s tr-focused stack is not optimal on ms. The per-type optima exist in the result JSONs; a router that selects between them per query has measurable headroom over the global optimum.

Query-intent-conditioned graph traversal. One step further than typed routing — the graph walk itself becomes intent-conditional, not just the rerank stage. Preference queries walk along ownership and statement edges; temporal queries walk along chronological edges with PPR bias toward recency; multi-session queries walk along bridge edges. Each intent activates a different substrate of the same graph.

None of these moves are cheap, and all of them are testable on the same deterministic 500-question harness. The 100-iteration loop demonstrated that bottleneck migration is observable; the next loop demonstrates whether the migration can be made automatic per query, instead of manual per era.


IX. Read the receipts

Every number in this post traces to a result JSON in benchmarks/external/results/loop-iter*. Reproduce against the LongMemEval-oracle 500q corpus.

git clone https://github.com/itsXactlY/mazemaker
cd mazemaker
make bench-oracle ITERS=00-100