The Mazemaker bench corpus now lives in Postgres.

One pg_restore, one sha256, four BEAM scales plus the full LongMemEval triplet. Plus a 282× bulk-write refactor and the two pg_attribute-namespace bugs that were silently leaking embedding dimensions across schemas.


I. Why move the corpus into Postgres at all

Until this week, the canonical home of the bench corpora was /tmp/BEAM — a sprawl of JSON files, shard subdirectories, and ad-hoc conv_*.json blobs that a fresh checkout had to download, unpack and verify by hand. Every clean-room run risked a different file ordering, a stale shard, or a quietly-mutated probing-question file. The harness worked, but reproducibility was the operator’s problem.

That is the wrong shape for a benchmark whose entire point is “a peer reviewer should be able to re-run this without arguing with us about the dataset.”

So we collapsed the whole bench corpus into a single Postgres snapshot — mm_bench_raw — with one row per turn, one row per question, one row per file-level source. Every source row carries a sha256 and an n_rows count. The dump is 957 MB. pg_restore roundtrip puts the schema back in roughly three seconds on a warm host. The dump file is the corpus; if your sha256 matches ours, you have the same bytes we did.

What’s in the dump

beam_10m            : 208 696 turns ·  10 convs ·   200 probing-questions
beam_1m             :  74 630 turns ·  35 convs ·   700 probing-questions
beam_500k           :  38 058 turns ·  35 convs ·   700 probing-questions
beam_100k           :   5 732 turns ·  20 convs ·   400 probing-questions
longmemeval_s       :     500 questions ·    247 K msgs ·  25 K sessions
longmemeval_m       :     500 questions ·  2 450 050 msgs · 250 948 sessions
longmemeval_oracle  :     500 questions ·   10 960 msgs ·     948 sessions
bench_meta.sources  :     207 file-level provenance rows (sha256 + n_rows)

Four BEAM scales (10M / 1M / 500K / 100K), all three LongMemEval variants (S / M / Oracle), 2 000 probing questions across BEAM, full file-level provenance. One pg_restore and you have it.


II. remember_batch() — the bulk-write path

Loading a 500-source Atomic Fact Extraction (AFE) batch used to take 88 minutes of wall time, almost all of it spent in the per-fact remember() path: one embedding call, one transaction, one commit, per fact. The store was correct. It was also embarrassing.

We landed remember_batch() in both the SQLite and Postgres stores. The new path batches the embedding pass and then issues a single executemany insert per backend — one round-trip to FastEmbed, one transaction, one commit.

AFE batch · 500 sources

before  remember() per fact  : 88 min   (~5 280 s)
after   remember_batch()      : ~75 s
speedup                        : 282×

The speedup is not a microbench. It is the time it now takes to load a real AFE corpus into a real store. We rebuild the corpus often enough that this changes what experiments are tractable in a single afternoon.


III. Schema-per-conv isolation (and the two bugs that almost shipped)

The new Postgres backend keeps each conversation in its own schema. conv_001, conv_002, …, each with its own memories table and its own pgvector column. HNSW indexes per schema, late-interaction tensors per schema, no cross-conv leakage on probing-question queries.

The DDL for a single conv schema is essentially this:

CREATE SCHEMA conv_001;
CREATE TABLE conv_001.memories (
    id          BIGSERIAL PRIMARY KEY,
    content     TEXT NOT NULL,
    embedding   vector(384),            -- pgvector 0.8.2
    metadata    JSONB,
    created_at  TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX ON conv_001.memories
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

Setting that up tripped over two real bugs in _ensure_embedding_column that are worth naming, because both were silent — they didn’t raise; they just produced wrong answers:

  1. pg_attribute queries without a namespace filter. The helper looked up the existing embedding dimension via pg_attribute but joined only on attrelid, so it picked up the first table named memories Postgres found — not the one in the conv schema being initialised. Result: the second conv inherited the first conv’s dim. If they differed, the insert path silently mismatched the vector type.
  2. Same shape in the “does column exist” check. The existence probe also crossed schemas, so a conv that should have provisioned a fresh column saw a sibling’s column and skipped the ALTER TABLE. The conv silently inherited a dimension it did not own.

The fix is mundane — add AND n.nspname = %s, scope every pg_attribute probe by namespace. The lesson is the usual one: every cross-schema query in Postgres is one missing nspname filter away from a wrong-answer bug, and these bugs do not throw. They just quietly converge on something that almost looks right.


IV. The payoff — LongMemEval-S, late-interaction on, iteration speed up

The reason any of this matters is iteration speed on real benchmarks. With the corpus pinned in Postgres and the bulk-write path landed, the LongMemEval-S retrieval sweep now reads from a single source of truth and finishes in the time the embedding model takes to run. The numbers below are from the most recent sweep on the new stack (results files committed in-repo):

Metricno-ColBERTColBERT @ 1.5Δ
R@10.80640.8574+5.10 pp
R@50.95960.9787+1.91 pp
R@100.98300.9894+0.64 pp
MRR0.87330.9114+3.81 pp
n_gradeable470 / 500470 / 500

The headline numbers are the same as the prior post — that’s the point. The corpus moved, the store changed, the bulk-write path landed, the retrieval numbers didn’t shift. The infrastructure swap is invariant under the benchmark, which is exactly what you want from an infrastructure swap.


V. What this unlocks next

Three things become tractable that weren’t last week:

  1. LongMemEval-M. 2.45M messages, 250 948 sessions. The medium variant has been in the public dataset since the original Wu et al. release. We have never run it. With the bulk-write path and the corpus pinned in Postgres, the load step is no longer prohibitive. Numbers when the sweep finishes.
  2. BEAM probing-questions scaling study. Same probing-question methodology across BEAM-100K, 500K, 1M, 10M, in one run, against the same engine config. We get a scaling curve, not a single dot. Infra is in place; sweep is running.
  3. Full mm_10m_eval 10-conv matrix on the new PG stack. Done. 442 verified questions, 770 rubric items, Sonnet judge v2. Aggregate mean 0.709 (95% CI [0.667, 0.750]); abstention precision 0.912; information_extraction 0.789; update_tracking 0.508. The long tail is topic-cluster on conv-10 at 0.024 — a real recall miss that goes onto the fix list, not a result we hide. Full per-conv breakdown on the research page.

None of these are claims yet. They are work in progress with the infrastructure to make them possible — that’s the entire point of this post.