Modelwire
Subscribe

Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering

Illustration accompanying: Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering

Researchers have identified a critical vulnerability in LLM-based legal systems: models fail when statutory law evolves beyond their training data, either by applying outdated rules or over-weighting recent provisions regardless of temporal relevance. A new benchmark of 312 German statutory QA pairs tests how GPT, Claude, and DeepSeek handle temporal reasoning across vanilla, web-search, and retrieval-augmented inference modes. This work exposes a fundamental mismatch between static parametric knowledge and dynamic legal systems, forcing practitioners to rethink deployment strategies for high-stakes domains where legal accuracy depends on knowing which version of a rule applies to a given fact pattern.

Modelwire context

Explainer

The paper's sharpest finding isn't just that models get outdated law wrong, it's that they also fail in the opposite direction, over-weighting recent statutory text even when an older version governs the facts at hand. That bidirectional failure means retrieval augmentation alone can't fix the problem without a layer of temporal reasoning about which version is legally operative for a given date.

The compliance and high-stakes data angle connects loosely to the Structure-Guided Entity Resolution work published the same day, which flagged a parallel theme: general LLM capability breaks down when domain constraints are rigid and errors are consequential. Both papers are essentially arguing that production deployment in regulated contexts requires more than prompting a capable base model. Where SGER addressed structural ambiguity in identity data, this paper addresses temporal ambiguity in legal corpora. Neither problem yields to off-the-shelf RAG pipelines without domain-specific scaffolding.

Watch whether any of the three tested model families, GPT, Claude, or DeepSeek, release legal-domain system cards or retrieval configurations that explicitly address version-dated statutory lookup. If none do within the next two quarters, this benchmark will likely be adopted by third-party legal AI auditors before the labs respond directly.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsOpenAI · Anthropic · DeepSeek · GPT · Claude

MW

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering · Modelwire