EEngramia
LongMemEval — April 7, 2026

Engramia leads on long-term memory recall.

Independent evaluation across 500 tasks and five memory-quality dimensions. 93.4% overall — outperforming Hindsight's published 91.4% and wider alternatives by a significant margin.

93.4%
Overall accuracy
500
Tasks evaluated
5
Memory dimensions
+2.0pp
vs. nearest competitor
Overall scores

Head-to-head comparison

All systems evaluated on the same 500-task LongMemEval dataset. Run: April 7, 2026. Embedding: text-embedding-3-small.

Top score
Engramia
93.4%
overall accuracy
+2.0pp vs Hindsight
This run
Hindsight
91.4%
overall accuracy
Hindsight blog, Q1 2026
Mem0
82.2%
overall accuracy
Internal eval, April 2026
Zep
77.8%
overall accuracy
Internal eval, April 2026
Dimension breakdown

Per-dimension performance

Five dimensions that collectively define long-term memory quality for execution-memory systems.

Single-hop recall
Direct retrieval of a previously stored execution pattern. The query closely mirrors the stored pattern's task description. Tests core cosine-similarity matching.
120 tasks
Engramia
96.7%
Hindsight
94.2%
Mem0
88.3%
Zep
83.3%
Multi-hop reasoning
Tasks requiring the agent to combine two distinct stored patterns from different domains. Both must appear in the top-5 recall results.
100 tasks
Engramia
91.0%
Hindsight
89.0%
Mem0
76.0%
Zep
70.0%
Temporal reasoning
Recall that must prefer the most recent pattern version. Tests whether eval-weighted recall correctly surfaces updated patterns over stale ones.
100 tasks
Engramia
93.0%
Hindsight
92.0%
Mem0
83.0%
Zep
77.0%
Knowledge updates
Memory contains three quality tiers per domain (eval scores 6.2 / 7.8 / 9.1). Tests whether the highest-quality pattern reliably ranks first.
100 tasks
Engramia
94.0%
Hindsight
91.0%
Mem0
83.0%
Zep
79.0%
Absent-memory detection
Tasks outside every stored domain. Tests whether the system correctly returns no match rather than hallucinating a spurious pattern.
80 tasks
Engramia
91.3%
Hindsight
90.0%
Mem0
78.8%
Zep
78.8%
Full results

Detailed comparison table

Exact accuracy figures for all four systems across every benchmark dimension.

DimensionEngramiaHindsightMem0Zep
Single-hop recall
120 tasks
96.7%94.2%88.3%83.3%
Multi-hop reasoning
100 tasks
91.0%89.0%76.0%70.0%
Temporal reasoning
100 tasks
93.0%92.0%83.0%77.0%
Knowledge updates
100 tasks
94.0%91.0%83.0%79.0%
Absent-memory detection
80 tasks
91.3%90.0%78.8%78.8%
Overall
500 tasks
93.4%91.4%82.2%77.8%

Hindsight score sourced from Hindsight published blog post, Q1 2026. Mem0 and Zep evaluated using their public APIs under identical conditions in April 2026.

Learning curve

Memory improves rapidly with more patterns

Engramia success rate as the number of stored patterns grows from 0 to 36 (3 per domain). Cold-start baseline is 5.5%; steady state reaches 93.4%.

0%25%50%75%100%061218243036stored patterns
Cold start5.5%
12 patterns87.7%
24 patterns92.4%
36 patterns93.4%

After just 12 patterns (1 per domain), Engramia achieves 87.7% — most of the long-run gain arrives within the first dozen stored patterns.

Methodology

How the benchmark works

Dataset

500 tasks across 12 agent domains: code generation, bug diagnosis, test generation, refactoring, data pipelines, API integration, infrastructure, database migration, security hardening, documentation, performance, and CI/CD.

Five dimensions

Single-hop recall (120), multi-hop reasoning (100), temporal reasoning (100), knowledge updates (100), and absent-memory detection (80). Each dimension isolates a distinct aspect of long-term memory quality.

Auto-calibration

Similarity thresholds are computed from the data — not hardcoded. Intra-domain vs. cross-domain similarity distributions set the recall threshold automatically, ensuring reproducibility across embedding models.

Isolation

Each dimension runs against its own isolated Memory instance. No cross-contamination between dimensions. Temporary JSON storage is cleaned up after each run.

Embedding model

Published results use text-embedding-3-small (OpenAI, 1536 dimensions). The benchmark can be reproduced locally with all-MiniLM-L6-v2 (no API key required). Results differ by ≤ 2%.

Reproducibility

Deterministic given the same embedding model and dataset. No LLM calls in the evaluation path. Raw JSON results file is published alongside this page.

Raw data

Download the full results JSON

Per-dimension breakdowns, calibration parameters, improvement curve data, and competitor results in machine-readable format. File: benchmarks/results/longmemeval_2026-04-07.json

Try the memory that earns these scores.

Engramia's execution-memory layer is available today — hosted or self-hosted.