Engramia leads on long-term memory recall.
Independent evaluation across 500 tasks and five memory-quality dimensions. 93.4% overall — outperforming Hindsight's published 91.4% and wider alternatives by a significant margin.
Head-to-head comparison
All systems evaluated on the same 500-task LongMemEval dataset. Run: April 7, 2026. Embedding: text-embedding-3-small.
Per-dimension performance
Five dimensions that collectively define long-term memory quality for execution-memory systems.
Detailed comparison table
Exact accuracy figures for all four systems across every benchmark dimension.
| Dimension | Engramia | Hindsight | Mem0 | Zep |
|---|---|---|---|---|
Single-hop recall 120 tasks | 96.7% | 94.2% | 88.3% | 83.3% |
Multi-hop reasoning 100 tasks | 91.0% | 89.0% | 76.0% | 70.0% |
Temporal reasoning 100 tasks | 93.0% | 92.0% | 83.0% | 77.0% |
Knowledge updates 100 tasks | 94.0% | 91.0% | 83.0% | 79.0% |
Absent-memory detection 80 tasks | 91.3% | 90.0% | 78.8% | 78.8% |
| Overall 500 tasks | 93.4% | 91.4% | 82.2% | 77.8% |
Hindsight score sourced from Hindsight published blog post, Q1 2026. Mem0 and Zep evaluated using their public APIs under identical conditions in April 2026.
Memory improves rapidly with more patterns
Engramia success rate as the number of stored patterns grows from 0 to 36 (3 per domain). Cold-start baseline is 5.5%; steady state reaches 93.4%.
After just 12 patterns (1 per domain), Engramia achieves 87.7% — most of the long-run gain arrives within the first dozen stored patterns.
How the benchmark works
Dataset
500 tasks across 12 agent domains: code generation, bug diagnosis, test generation, refactoring, data pipelines, API integration, infrastructure, database migration, security hardening, documentation, performance, and CI/CD.
Five dimensions
Single-hop recall (120), multi-hop reasoning (100), temporal reasoning (100), knowledge updates (100), and absent-memory detection (80). Each dimension isolates a distinct aspect of long-term memory quality.
Auto-calibration
Similarity thresholds are computed from the data — not hardcoded. Intra-domain vs. cross-domain similarity distributions set the recall threshold automatically, ensuring reproducibility across embedding models.
Isolation
Each dimension runs against its own isolated Memory instance. No cross-contamination between dimensions. Temporary JSON storage is cleaned up after each run.
Embedding model
Published results use text-embedding-3-small (OpenAI, 1536 dimensions). The benchmark can be reproduced locally with all-MiniLM-L6-v2 (no API key required). Results differ by ≤ 2%.
Reproducibility
Deterministic given the same embedding model and dataset. No LLM calls in the evaluation path. Raw JSON results file is published alongside this page.
Download the full results JSON
Per-dimension breakdowns, calibration parameters, improvement curve data, and competitor results in machine-readable format. File: benchmarks/results/longmemeval_2026-04-07.json
Try the memory that earns these scores.
Engramia's execution-memory layer is available today — hosted or self-hosted.