Lab Notes — the maze, walked

Operator commentary · 2026—

Five chapters. One labyrinth. Each post is a station deeper into the engine. Walk them top-down for the full arc, or jump in anywhere — each station stands alone. Hardfacts only: every claim traces to a JSON in benchmarks/external/results/ or to an audit transcript in benchmarks/audit/.

// station 05 · the summit · 2026-05-18 · flagship

Inside the 100-iteration loop.

The full bench-driven engineering loop. iter00 anchor R@5 = 0.6851 → iter100 champion R@5 = 0.8426. Four architectural eras (retrieval → formation → rerank-feedback → top-K sharpness). The empirical proof that the bottleneck migrated upward in the cognition stack. Honest about the failed levers as much as the winning ones — naive canonicalization regression, salience 3.0 over-boost, the KU rebake we rolled back. The longform reference for anyone evaluating the system.

iter00 → iter100 R@5 0.6851 → 0.8426 R@10 = 0.9000 R@1 = 0.6255 4 architectural eras 1 breakthrough finding

enter station 05 →
// station 04 · the junction · 2026-05-18 · flagship · methodology

Inception Benchmarking: the benchmark that did not exist.

Why we built our own memory benchmark instead of trusting the published ones. The Hindsight int() truncation bug — 9 of 10 ability evaluators silently dropped 0.5 partial credits. The 4/20 rubric-defect rate on BEAM-10M conv-1, cross-confirmed in a blind audit. The 16pp judge spread on identical answers. The 10 models published as “0/N · not viable” that we ran at 90%. If your benchmark cannot survive its own audit, your #1 is decoration, not signal.

9/10 evaluators bugged 4/20 rubric defects 16pp judge spread 0/N → 18/20 (gemma3:270m) Hop-2 R@10: 0.00 → 1.00

enter station 04 →
// station 03 · the climb · 2026-05-18 · benchmarks · formation

Formation beats retrieval-tuning.

100 iterations on the hard LongMemEval-oracle 500q across four architectural eras. R@5 climbed 0.6851 → 0.8426 (+15.75pp absolute, crossing the 0.84 stretch target). R@10 broke the 0.90 barrier (0.9000). ssu R@10 hit perfect 1.0000. The retrieval-side knob surface saturated at 0.7404; memory formation broke through; the rerank-feedback discovery pushed it further. The bottleneck migrated upward in the cognition stack.

R@5 = 0.8426 R@10 = 0.9000 R@1 = 0.6255 MRR = 0.7124 ssu R@10 = 1.0000 100 iterations · 4 eras

enter station 03 →
// station 02 · the corridor · 2026-05-13 · infrastructure

The Mazemaker bench corpus now lives in Postgres.

One pg_restore, one sha256, four BEAM scales plus the full LongMemEval triplet. Plus a 282× bulk-write refactor and the two pg_attribute-namespace bugs that were silently leaking embedding dimensions across schemas.

1 pg_restore = full corpus 957 MB snapshot ~3 s restore remember_batch: 88 min → 75 s (282×) 207 sha256 provenance rows

enter station 02 →
// station 01 · the entrance · 2026-05-10 · audit

Memory benchmarks should measure memory.

The note that started the maze. Why we ran the same 10 models Hindsight published as “not viable”, and what 270 million parameters can do when you stop gating on JSON. Eight rounds of GPT-5.5 review, v2 NO → v8 UNCONDITIONAL YES, no residual caveat.

R@5 = 0.9787 MRR = 0.9114 188/200 = 94.0% 0 errors deterministic Hop-2 R@10: 0.00 → 1.00

enter station 01 →

More stations as the maze grows. Audit transcripts at benchmarks/audit/. Result JSONs at benchmarks/external/results/. Full claim-evidence bundle (1.6 GB — restorable pg_dump + all iter JSONs + 8-round audit + bench-loop logs): ProtonDrive · SHA-256 263e2494….

Lab Notes — the maze, walked

Inside the 100-iteration loop.

Inception Benchmarking: the benchmark that did not exist.

Formation beats retrieval-tuning.

The Mazemaker bench corpus now lives in Postgres.

Memory benchmarks should measure memory.