Modelwire
Subscribe

Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

Illustration accompanying: Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

Research agents built on LLMs routinely cite sources in synthesized reports, but those citations go largely unverified, creating a credibility gap between apparent rigor and actual accuracy. This paper introduces the first systematic framework for extracting and validating inline citations from model-generated markdown at scale, using AST parsing to retrieve actual source content and measure consistency between claims and their references. The work addresses a critical blind spot in production AI systems: while RAG improves factuality, it doesn't guarantee that cited sources are accessible, relevant, or actually support the claims attributed to them. For teams deploying research agents or evaluating LLM outputs, this framework offers a reproducible method to audit citation integrity and expose hallucinated or mismatched attributions.

Modelwire context

Explainer

The paper's sharpest contribution isn't the citation-checking itself but the scale argument: it establishes that inline citation validation can be automated across entire report corpora, not just spot-checked by humans, which changes what 'auditing' an agent's output actually means operationally.

This connects directly to two threads in recent coverage. The SIRA retrieval agent piece from May 7 showed how agents can improve what they retrieve, but retrieving better sources doesn't fix the downstream problem this paper targets: whether claims are actually supported by whatever was retrieved. Earlier, the RAG medical chatbot security audit from May 1 surfaced how production RAG systems carry risks that builders underestimate. Citation integrity is a quieter version of the same gap: the system looks rigorous because it produces references, but the references may not hold up. Together these three stories sketch a pattern where RAG's surface credibility consistently outpaces its verified reliability.

Watch whether any of the major research agent platforms (Perplexity, OpenAI Deep Research, or similar) adopt or respond to this framework's methodology within the next two quarters. Adoption would signal the industry treating citation integrity as a product requirement rather than an academic concern.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLLMs · RAG · AST parser · deep research agents

MW

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents · Modelwire