Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents

Researchers have built the first unified benchmark for detecting hallucinations at the span level across code, tool outputs, and structured documents, moving beyond the natural-language-only focus of prior RAG evaluation work. A fine-tuned Qwen 3.5-2B model achieves 0.689 span-F1 on the combined test set and substantially outperforms existing baselines on code-agent tasks. This matters because production AI systems increasingly ground reasoning in heterogeneous sources like repositories and CLI output, yet hallucination detection methods remain calibrated for prose. The benchmark and detector provide a foundation for building more reliable code-aware retrieval systems.
Modelwire context
ExplainerThe key gap this fills is not just accuracy but scope: prior hallucination detection benchmarks treat grounding as a prose-to-prose comparison problem, but production code agents routinely reason over CLI output, stack traces, and repository diffs where token-level fidelity means something structurally different than in natural language.
This connects directly to the FinKG-News work covered the same day, which found that automated hallucination detection remains unreliable even when inputs are structured and grounded. That paper flagged the problem in financial documents; this one attacks the same failure mode in code and tool output. Together they sketch a pattern: as AI systems move into higher-stakes, heterogeneous-source environments, the hallucination detection tooling built for prose RAG simply does not transfer. The MAGNET story from the same period is also relevant, where the ATLAS verifier handles scene-level narrative consistency, reinforcing that domain-specific detection architectures are becoming a recurring design choice rather than an edge case.
Watch whether the benchmark gets adopted by code-agent evaluation suites like SWE-bench or similar within the next two quarters. If it does, the Qwen 3.5-2B detector results will face real stress-testing against diverse agent trajectories and the 0.689 span-F1 figure will either hold or reveal the limits of the training distribution.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsQwen 3.5-2B · arXiv
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.