← Comparison matrix · 2026-05-10 · QUEUED

Letta vs Mazemaker

Letta (formerly MemGPT) ships an OS-style memory hierarchy — main context + archival + recall channels — with public LongMemEval results in their paper. We will run it on the same harness as our master baseline and publish the JSON here. No number until the run lands.

Their architecture

Letta’s thesis is "memory as an OS": the LLM-context window is treated as RAM, with archival and recall channels swapping into and out of context via tool calls. The agent reasons about when to fetch its own memory rather than retrieving on every turn. Public benchmarks live in their paper and the letta repo.

Methodology — locked

Dataset: LongMemEval-S, all 500 questions, identical haystack split.
Retrieval: top-k=10. Letta gets its native main+archival+recall channels; Mazemaker gets hybrid + ColBERT @ 1.5.
Judge: identical — substring_match on the LongMemEval gold span.
Metrics: R@1, R@5, R@10, MRR, p50/p95 retrieval latency, total tokens spent on tool-call overhead.
Hardware: 16 GB VRAM, identical embedding backend (BGE-M3 1024d) where the system supports an external embedder; otherwise document the engine’s default.
Reproducibility: a single shell script in benchmarks/external/letta_run.sh, committed before the run.

Mazemaker reference (same harness)

Config	R@1	R@5	R@10	MRR	p50
master baseline (hybrid)	0.8064	0.9596	0.983	0.8733	—
+ ColBERT @ 1.5	0.8574	0.9787	0.9894	0.9114	56.9 ms

Mazemaker JSON: longmemeval_s_colbert-on-master_20260510T034308Z.json

What “queued” means

Harness exists. Methodology is locked. Run script is in the repo or staged.
The run hasn’t been executed yet, or has been executed but the JSON isn’t in benchmarks/external/results/ yet.
When the JSON lands, this page updates with the verified table — same shape as the Hindsight page.
If Letta’s numbers reverse our finding, we publish them here, unedited, with our re-run in the same JSON.

Why this comparison matters

Letta and Mazemaker disagree about where the memory work should happen. Letta puts a tool-calling LLM in the loop — the agent decides when to consult memory. Mazemaker puts memory in the retrieval path — the agent always sees the top-k and never has to ask. Both can be the right call. The benchmark answers which one retrieves the fact more often, at what cost, on the same questions.

The cost axis matters because Letta’s tool-call loop is not free: every "consult archival" decision is an extra LLM round-trip. We’ll report total tokens spent so the cost-vs-quality tradeoff is legible.

When this page updates: github issues · blog feed. License: AGPLv3 + PolyForm-NC. Contact: info@mazemaker.dev.