LongMemEval — April 20, 2026

Engramia, measured honestly.

99.8% across 500 long-term-memory tasks under pre-registered thresholds — no post-hoc tuning, no self-referential calibration, and a seeded random-recall floor published alongside every run.

Read methodology View source

99.8%

Overall accuracy

—

Random baseline

500

Tasks evaluated

Yes

Deterministic

Overall score

99.8% on the full 500-task suite

Run: April 20, 2026 · Engramia 0.6.6.dev0+gf2158879f.d20260404 · Embedding: text-embedding-3-small.

Latest run

Engramia

99.8%

overall accuracy

499/500 tasks passing

Single-hop threshold (pre-registered, model-agnostic): 0.4

Dimension breakdown

Per-dimension performance

Five dimensions that collectively define long-term memory quality for execution-memory systems.

Single-hop recall

Direct retrieval of a previously stored execution pattern. The query closely mirrors the stored pattern's task description. Uses a single pre-registered cosine-similarity threshold (0.50), frozen across embedding models so the number cannot be tuned per run.

120 tasks

99.2%

119/120

Multi-hop reasoning

Tasks requiring the agent to combine two distinct stored patterns from different domains. Both must appear in the top-5 recall results.

100 tasks

100.0%

100/100

Temporal reasoning

Queries asking for the most recent version of a pattern. Runs with eval_weighted=False — the earlier harness conflated 'most recent' with 'highest success_score', which reduced to a tautology. Pass rule: top-1 contains the v3 marker AND has the maximum stored timestamp among the returned matches.

100 tasks

100.0%

100/100

Knowledge updates

Memory contains three quality tiers per domain (eval scores 6.2 / 7.8 / 9.1). Tests whether eval_weighted=True reliably surfaces the best known version. The random-recall baseline shows a ~40% floor here by construction; discrimination is the gap above that floor.

100 tasks

100.0%

100/100

Absent-memory detection

Tasks outside every stored domain. The noise-similarity threshold is auto-calibrated on a held-out pool (NOISE_CALIBRATION_POOL) strictly disjoint from the evaluation queries — an earlier version sampled both from the same list and produced a trivial 100%.

80 tasks

100.0%

80/80

Methodology

How the benchmark works

Dataset

500 tasks across 12 agent domains: code generation, bug diagnosis, test generation, refactoring, data pipelines, API integration, infrastructure, database migration, security hardening, documentation, performance, and CI/CD.

Five dimensions

Single-hop recall (120), multi-hop reasoning (100), temporal reasoning (100), knowledge updates (100), and absent-memory detection (80). Each dimension isolates a distinct aspect of long-term memory quality.

Pre-registered thresholds

Single-hop's pass bar is a single model-agnostic constant (0.50) frozen in source control before any run — a different embedding model producing a different score is the signal this benchmark is trying to surface, not a knob to tune. Absent-memory detection auto-calibrates its noise threshold on a held-out pool of queries that never appears in the graded evaluation set.

Random-recall baseline

Every run can be invoked with --include-random-baseline. The same 500 tasks are scored against a seeded random-recall stub (36 synthetic patterns, random picks with random similarities), and per-dimension numbers are written into comparison.random_baseline in the output JSON. A dimension whose real score is near the baseline is not measuring retrieval quality — just surface rules a coin-flip can also satisfy.

Isolation

Each dimension runs against its own isolated Memory instance. No cross-contamination between dimensions. Temporary JSON storage is cleaned up after each run.

Deterministic

Every benchmark recall call passes readonly=True so mark_reused does not mutate success_score between queries. Two back-to-back runs at the same embedding model and random-baseline seed are bit-identical.

Competitor comparison

Withheld. Prior pre-release numbers for Hindsight / Mem0 / Zep lived under a different methodology we no longer consider apples-to-apples. When each can be re-produced on this exact harness — with its stated default configuration — they will reappear here with a pointer to the raw run.

Raw data

Download the full results JSON

Per-dimension breakdowns, evaluation config, dataset summary, and the exact Engramia version + embedding model used. The page above is rendered directly from this file — no curation between harness and marketing copy.

Download JSON Methodology docs

Try the memory that earns these scores.

Engramia's execution-memory layer is available today — hosted or self-hosted.

Start with Pro Explore docs