Letta vs Mazemaker
Letta (formerly MemGPT) ships an OS-style memory hierarchy — main context + archival + recall channels — with public LongMemEval results in their paper. We will run it on the same harness as our master baseline and publish the JSON here. No number until the run lands.
Their architecture
Letta’s thesis is "memory as an OS": the LLM-context window is treated as RAM, with archival and recall channels swapping into and out of context via tool calls. The agent reasons about when to fetch its own memory rather than retrieving on every turn. Public benchmarks live in their paper and the letta repo.
Methodology — locked
- Dataset: LongMemEval-S, all 500 questions, identical haystack split.
- Retrieval: top-k=10. Letta gets its native main+archival+recall channels; Mazemaker gets hybrid + ColBERT @ 1.5.
- Judge: identical —
substring_matchon the LongMemEval gold span. - Metrics: R@1, R@5, R@10, MRR, p50/p95 retrieval latency, total tokens spent on tool-call overhead.
- Hardware: 16 GB VRAM, identical embedding backend (BGE-M3 1024d) where the system supports an external embedder; otherwise document the engine’s default.
- Reproducibility: a single shell script in
benchmarks/external/letta_run.sh, committed before the run.
Mazemaker reference (same harness)
| Config | R@1 | R@5 | R@10 | MRR | p50 |
|---|---|---|---|---|---|
| master baseline (hybrid) | 0.8064 | 0.9596 | 0.983 | 0.8733 | — |
| + ColBERT @ 1.5 | 0.8574 | 0.9787 | 0.9894 | 0.9114 | 56.9 ms |
What “queued” means
- Harness exists. Methodology is locked. Run script is in the repo or staged.
- The run hasn’t been executed yet, or has been executed but the JSON isn’t in
benchmarks/external/results/yet. - When the JSON lands, this page updates with the verified table — same shape as the Hindsight page.
- If Letta’s numbers reverse our finding, we publish them here, unedited, with our re-run in the same JSON.
Why this comparison matters
Letta and Mazemaker disagree about where the memory work should happen. Letta puts a tool-calling LLM in the loop — the agent decides when to consult memory. Mazemaker puts memory in the retrieval path — the agent always sees the top-k and never has to ask. Both can be the right call. The benchmark answers which one retrieves the fact more often, at what cost, on the same questions.
The cost axis matters because Letta’s tool-call loop is not free: every "consult archival" decision is an extra LLM round-trip. We’ll report total tokens spent so the cost-vs-quality tradeoff is legible.