Research Models & Releases·arXiv cs.CL·6d ago

MEME: Multi-entity & Evolving Memory Evaluation

A new benchmark reveals a critical failure mode in LLM-based agents operating across multiple sessions. MEME exposes that memory systems collapse when reasoning about dependencies between stored facts, achieving near-zero accuracy on cascade and absence tasks despite strong static retrieval. The finding matters because production agents increasingly need persistent memory across conversations, yet current architectures and optimization strategies fail to bridge this gap, suggesting fundamental architectural rethinking is needed before deployment in real-world multi-turn environments.

Modelwire context

Explainer

The critical detail buried in the benchmark design is that MEME specifically isolates cascade and absence reasoning, meaning agents must track how one stored fact invalidates or depends on another across sessions. This is categorically different from retrieval quality, and the near-zero accuracy scores suggest current memory architectures were never actually designed to handle this class of problem.

Two related benchmarks published the same day make this cluster worth treating as a signal rather than coincidence. LongMemEval-V2 targets a similar gap, asking whether agents internalize environment-specific knowledge over extended interactions rather than just retrieving it on demand. Together, these two efforts suggest the evaluation community is converging on relational and temporal reasoning as the next frontier for agent memory, not storage capacity or retrieval speed. The 'Learning, Fast and Slow' paper from the same day adds relevant architecture context, since its dual-timescale framing directly addresses how parameter updates and in-context adaptation interact, which is precisely the tension MEME exposes in memory systems.

Watch whether any of the major agent memory frameworks (MemGPT, Zep, or comparable open projects) publish MEME scores within the next two quarters. If none do, that absence itself signals the benchmark found something practitioners would rather not measure publicly.

Coverage we drew on

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsMEME · LLM-based agents

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.