MEME: Multi-entity & Evolving Memory Evaluation

Seokwon Jung, Alexander Rubinstein, Arnas Uselis, Sangdoo Yun, Seong Joon Oh

Abstract

LLM-based agents in persistent environments must store, update, and reason over information across sessions. Prior benchmarks evaluate only single-entity updates. MEME defines six tasks spanning the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Across six memory systems on 100 controlled episodes, all systems collapse on dependency reasoning (Cascade: 3%, Absence: 1% average accuracy) despite adequate static retrieval. Prompt optimisation, deeper retrieval, and stronger LLMs do not close this gap. Only a file-based agent paired with Claude Opus 4.7 partially closes it - at roughly 70x the baseline cost.

Publication

arXiv 2026

Links

arXiv PDF Project