LongMemEval — April 20, 2026
Engramia, measured honestly.
99.8% across 500 long-term-memory tasks under pre-registered thresholds — no post-hoc tuning, no self-referential calibration, and a seeded random-recall floor published alongside every run.
99.8% on the full 500-task suite
Run: April 20, 2026 · Engramia 0.6.6.dev0+gf2158879f.d20260404 · Embedding: text-embedding-3-small.
Per-dimension performance
Five dimensions that collectively define long-term memory quality for execution-memory systems.
How the benchmark works
Dataset
500 tasks across 12 agent domains: code generation, bug diagnosis, test generation, refactoring, data pipelines, API integration, infrastructure, database migration, security hardening, documentation, performance, and CI/CD.
Five dimensions
Single-hop recall (120), multi-hop reasoning (100), temporal reasoning (100), knowledge updates (100), and absent-memory detection (80). Each dimension isolates a distinct aspect of long-term memory quality.
Pre-registered thresholds
Single-hop's pass bar is a single model-agnostic constant (0.50) frozen in source control before any run — a different embedding model producing a different score is the signal this benchmark is trying to surface, not a knob to tune. Absent-memory detection auto-calibrates its noise threshold on a held-out pool of queries that never appears in the graded evaluation set.
Random-recall baseline
Every run can be invoked with --include-random-baseline. The same 500 tasks are scored against a seeded random-recall stub (36 synthetic patterns, random picks with random similarities), and per-dimension numbers are written into comparison.random_baseline in the output JSON. A dimension whose real score is near the baseline is not measuring retrieval quality — just surface rules a coin-flip can also satisfy.
Isolation
Each dimension runs against its own isolated Memory instance. No cross-contamination between dimensions. Temporary JSON storage is cleaned up after each run.
Deterministic
Every benchmark recall call passes readonly=True so mark_reused does not mutate success_score between queries. Two back-to-back runs at the same embedding model and random-baseline seed are bit-identical.
Competitor comparison
Withheld. Prior pre-release numbers for Hindsight / Mem0 / Zep lived under a different methodology we no longer consider apples-to-apples. When each can be re-produced on this exact harness — with its stated default configuration — they will reappear here with a pointer to the raw run.
Download the full results JSON
Per-dimension breakdowns, evaluation config, dataset summary, and the exact Engramia version + embedding model used. The page above is rendered directly from this file — no curation between harness and marketing copy.
Try the memory that earns these scores.
Engramia's execution-memory layer is available today — hosted or self-hosted.