The comparison matrix
Four projects often discussed under the “AI memory” umbrella, all benched on the same retrieval harness. Verified numbers where we have them. Queued runs where we don’t. No fabricated numbers.
Reference points from our own 100-iteration loop on the harder LongMemEval-oracle 500q (25,000-memory haystack per question): R@5 = 0.8426 (iter95), R@10 = 0.9000 (iter97), R@1 = 0.6255, MRR = 0.7124, ssu R@10 = 1.0000. Methodology and per-iteration audit at /research; the full story at /blog/inside-the-100-iteration-loop/; the benchmark engine itself documented at /blog/inception-benchmarking/.
Every row in the table below links to a focused page. If a number appears, the JSON it came from is in benchmarks/external/results/ and the run script is reproducible from a single curl. If “queued” appears, the harness exists and the methodology is locked — we’re still scheduling the run. We don’t publish numbers we haven’t measured.
This is comparison work, not a competition. Different memory systems make different tradeoffs — tool-call cost, ingest cost, graph quality, latency. We surface those tradeoffs on the same dataset so the choice is legible.
| Project | Architecture (their framing) | Our run | Status | |
|---|---|---|---|---|
| Hindsight | Published an evaluation of 10 small LLMs against a structured JSON-output protocol; scored 0/N across all ten. The published note describes JSON-schema conformance, not retrieval ability. | Same 10 models, same haystack, plain-text prompting through Mazemaker retrieval. | VERIFIED 188/200 = 94.0% |
read → |
| Letta (formerly MemGPT) | OS-style memory hierarchy with main + archival + recall channels. Public LongMemEval results in their paper. | Identical LongMemEval-S 500-question harness against Mazemaker hybrid + ColBERT @ 1.5. | QUEUED | read → |
| A-MEM | Agentic memory framework, paper-published. Zettelkasten-style note linking with LLM-driven evolution. | Same LongMemEval-S 500 against Mazemaker; ablate the LLM-evolution stage to isolate retrieval signal. | QUEUED | read → |
| Cognee | Knowledge-graph + vector hybrid. Self-described as “AI memory at scale”. | Same LongMemEval-S 500. Cognee’s LLM-graph-construction step is included — cost + latency are part of the comparison. | QUEUED | read → |
Mazemaker baseline (same harness)
Every "our run" cell above measures against this:
| Bench | Config | R@1 | R@5 | R@10 | MRR | p50 |
|---|---|---|---|---|---|---|
| LongMemEval-S 500q | hybrid, master baseline | 0.8064 | 0.9596 | 0.983 | 0.8733 | — |
| LongMemEval-S 500q | hybrid + ColBERT @ 1.5 | 0.8574 | 0.9787 | 0.9894 | 0.9114 | 56.9 ms |
| Comparison Bench (10 models, n=20) | hybrid + ColBERT @ 1.5, plain-text | 188/200 = 94.0% | < 1 s end-to-end | |||
Reading rules
- VERIFIED = JSON exists in
benchmarks/external/results/; run script reproduces from a single command; numbers cited match the JSON. - QUEUED = methodology is locked; harness exists; no number cited until a JSON exists.
- If you find a discrepancy between the page and the JSON, that’s a bug — not a marketing decision.
- If another team produces a result that reverses our finding, we publish it on this page, unedited, with our re-run.