Research Tools & Code·arXiv cs.CL·1d ago

Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents

Researchers have built the first unified benchmark for detecting hallucinations at the span level across code, tool outputs, and structured documents, moving beyond the natural-language-only focus of prior RAG evaluation work. A fine-tuned Qwen 3.5-2B model achieves 0.689 span-F1 on the combined test set and substantially outperforms existing baselines on code-agent tasks. This matters because production AI systems increasingly ground reasoning in heterogeneous sources like repositories and CLI output, yet hallucination detection methods remain calibrated for prose. The benchmark and detector provide a foundation for building more reliable code-aware retrieval systems.

Modelwire context

Explainer

The key gap this fills is not just accuracy but scope: prior hallucination detection benchmarks treat grounding as a prose-to-prose comparison problem, but production code agents routinely reason over CLI output, stack traces, and repository diffs where token-level fidelity means something structurally different than in natural language.

This connects directly to the FinKG-News work covered the same day, which found that automated hallucination detection remains unreliable even when inputs are structured and grounded. That paper flagged the problem in financial documents; this one attacks the same failure mode in code and tool output. Together they sketch a pattern: as AI systems move into higher-stakes, heterogeneous-source environments, the hallucination detection tooling built for prose RAG simply does not transfer. The MAGNET story from the same period is also relevant, where the ATLAS verifier handles scene-level narrative consistency, reinforcing that domain-specific detection architectures are becoming a recurring design choice rather than an edge case.

Watch whether the benchmark gets adopted by code-agent evaluation suites like SWE-bench or similar within the next two quarters. If it does, the Qwen 3.5-2B detector results will face real stress-testing against diverse agent trajectories and the 0.689 span-F1 figure will either hold or reveal the limits of the training distribution.

Coverage we drew on

Evidence-Supported Credit Risk Report Generation Using News-Centric Financial Knowledge Graphs · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsQwen 3.5-2B · arXiv

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Research

Evidence-Supported Credit Risk Report Generation Using News-Centric Financial Knowledge Graphs

arXiv cs.CL·1d ago

Research

From Personas to Plot: Character-Grounded Multi-Agent Story Generation for Long-Form Narratives

arXiv cs.CL·1d ago

Products & Apps

You Can Now Sound the Alarm on AI Behaving Badly

WIRED - AI·1d ago