Lab Notes — the maze, walked
Five chapters. One labyrinth. Each post is a station deeper into the engine. Walk them top-down for the full arc, or jump in anywhere — each station stands alone. Hardfacts only: every claim traces to a JSON in benchmarks/external/results/ or to an audit transcript in benchmarks/audit/.
-
Inside the 100-iteration loop.
The full bench-driven engineering loop. iter00 anchor R@5 = 0.6851 → iter100 champion R@5 = 0.8426. Four architectural eras (retrieval → formation → rerank-feedback → top-K sharpness). The empirical proof that the bottleneck migrated upward in the cognition stack. Honest about the failed levers as much as the winning ones — naive canonicalization regression, salience 3.0 over-boost, the KU rebake we rolled back. The longform reference for anyone evaluating the system.
iter00 → iter100 R@5 0.6851 → 0.8426 R@10 = 0.9000 R@1 = 0.6255 4 architectural eras 1 breakthrough finding -
Inception Benchmarking: the benchmark that did not exist.
Why we built our own memory benchmark instead of trusting the published ones. The Hindsight
int()truncation bug — 9 of 10 ability evaluators silently dropped 0.5 partial credits. The 4/20 rubric-defect rate on BEAM-10M conv-1, cross-confirmed in a blind audit. The 16pp judge spread on identical answers. The 10 models published as “0/N · not viable” that we ran at 90%. If your benchmark cannot survive its own audit, your #1 is decoration, not signal.9/10 evaluators bugged 4/20 rubric defects 16pp judge spread 0/N → 18/20 (gemma3:270m) Hop-2 R@10: 0.00 → 1.00 -
Formation beats retrieval-tuning.
100 iterations on the hard LongMemEval-oracle 500q across four architectural eras. R@5 climbed 0.6851 → 0.8426 (+15.75pp absolute, crossing the 0.84 stretch target). R@10 broke the 0.90 barrier (0.9000). ssu R@10 hit perfect 1.0000. The retrieval-side knob surface saturated at 0.7404; memory formation broke through; the rerank-feedback discovery pushed it further. The bottleneck migrated upward in the cognition stack.
R@5 = 0.8426 R@10 = 0.9000 R@1 = 0.6255 MRR = 0.7124 ssu R@10 = 1.0000 100 iterations · 4 eras -
The Mazemaker bench corpus now lives in Postgres.
One
pg_restore, one sha256, four BEAM scales plus the full LongMemEval triplet. Plus a 282× bulk-write refactor and the two pg_attribute-namespace bugs that were silently leaking embedding dimensions across schemas.1 pg_restore = full corpus 957 MB snapshot ~3 s restore remember_batch: 88 min → 75 s (282×) 207 sha256 provenance rows -
Memory benchmarks should measure memory.
The note that started the maze. Why we ran the same 10 models Hindsight published as “not viable”, and what 270 million parameters can do when you stop gating on JSON. Eight rounds of GPT-5.5 review, v2 NO → v8 UNCONDITIONAL YES, no residual caveat.
R@5 = 0.9787 MRR = 0.9114 188/200 = 94.0% 0 errors deterministic Hop-2 R@10: 0.00 → 1.00