The comparison matrix

Lab notes · 2026-05-10 · aLca, Mazemaker

Four projects often discussed under the “AI memory” umbrella, all benched on the same retrieval harness. Verified numbers where we have them. Queued runs where we don’t. No fabricated numbers.

Reference points from our own 100-iteration loop on the harder LongMemEval-oracle 500q (25,000-memory haystack per question): R@5 = 0.8426 (iter95), R@10 = 0.9000 (iter97), R@1 = 0.6255, MRR = 0.7124, ssu R@10 = 1.0000. Methodology and per-iteration audit at /research; the full story at /blog/inside-the-100-iteration-loop/; the benchmark engine itself documented at /blog/inception-benchmarking/.

Every row in the table below links to a focused page. If a number appears, the JSON it came from is in benchmarks/external/results/ and the run script is reproducible from a single curl. If “queued” appears, the harness exists and the methodology is locked — we’re still scheduling the run. We don’t publish numbers we haven’t measured.

This is comparison work, not a competition. Different memory systems make different tradeoffs — tool-call cost, ingest cost, graph quality, latency. We surface those tradeoffs on the same dataset so the choice is legible.

Project	Architecture (their framing)	Our run	Status
Hindsight	Published an evaluation of 10 small LLMs against a structured JSON-output protocol; scored 0/N across all ten. The published note describes JSON-schema conformance, not retrieval ability.	Same 10 models, same haystack, plain-text prompting through Mazemaker retrieval.	VERIFIED 188/200 = 94.0%	read →
Letta (formerly MemGPT)	OS-style memory hierarchy with main + archival + recall channels. Public LongMemEval results in their paper.	Identical LongMemEval-S 500-question harness against Mazemaker hybrid + ColBERT @ 1.5.	QUEUED	read →
A-MEM	Agentic memory framework, paper-published. Zettelkasten-style note linking with LLM-driven evolution.	Same LongMemEval-S 500 against Mazemaker; ablate the LLM-evolution stage to isolate retrieval signal.	QUEUED	read →
Cognee	Knowledge-graph + vector hybrid. Self-described as “AI memory at scale”.	Same LongMemEval-S 500. Cognee’s LLM-graph-construction step is included — cost + latency are part of the comparison.	QUEUED	read →

Mazemaker baseline (same harness)

Every "our run" cell above measures against this:

Bench	Config	R@1	R@5	R@10	MRR	p50
LongMemEval-S 500q	hybrid, master baseline	0.8064	0.9596	0.983	0.8733	—
LongMemEval-S 500q	hybrid + ColBERT @ 1.5	0.8574	0.9787	0.9894	0.9114	56.9 ms
Comparison Bench (10 models, n=20)	hybrid + ColBERT @ 1.5, plain-text	188/200 = 94.0%				< 1 s end-to-end

Result JSONs: benchmarks/external/results/. Audit trail: benchmarks/audit/.

Reading rules

VERIFIED = JSON exists in benchmarks/external/results/; run script reproduces from a single command; numbers cited match the JSON.
QUEUED = methodology is locked; harness exists; no number cited until a JSON exists.
If you find a discrepancy between the page and the JSON, that’s a bug — not a marketing decision.
If another team produces a result that reverses our finding, we publish it on this page, unedited, with our re-run.

Code: github.com/itsXactlY/mazemaker. License: AGPLv3 + PolyForm-NC. Contact: info@mazemaker.dev.