Tracing Target Answers in Poisoned Retrieval Corpora via Token Influence Attribution

RAG systems face a critical vulnerability: poisoned retrieval corpora can steer model outputs toward attacker-specified answers without obvious traces. TRACE, a new detection framework, flips the problem by using token influence attribution to identify which retrieved documents are steering predictions, then validates their actual impact on model behavior. The approach is computationally lean compared to existing defenses that layer on auxiliary classifiers or extra LLM calls. Testing across six LLMs and three QA benchmarks shows the method catches poisoning while exposing the attacker's intended target answers. This matters because RAG is becoming standard infrastructure for production LLMs, making corpus integrity a supply-chain security concern.
Modelwire context
ExplainerMost RAG security work focuses on blocking malicious queries at the input layer. TRACE inverts that posture entirely, working backward from model predictions to identify which documents in the retrieval corpus are doing the steering, and crucially, what answer the attacker wanted the model to produce.
The mechanistic inspection angle here rhymes closely with the RAS paper covered the same day, which proposed SafeVec as a way to evaluate LLM safety by reading internal model representations rather than judging outputs. Both papers are pushing the same underlying argument: behavioral testing at the surface is brittle, and the more reliable signal lives inside the model's processing. TRACE applies that logic to a supply-chain threat rather than a refusal-alignment problem, but the methodological kinship is real. Where SafeVec asks whether a model's hidden states align with safe refusal directions, TRACE asks whether token influence patterns reveal a document that is quietly pulling predictions toward an attacker's target. Together they sketch a broader shift toward mechanistic auditing as a practical security primitive.
The real test is whether TRACE's attribution approach holds when attackers know the defense exists and craft poisoned documents specifically to distribute influence across many tokens. If the authors or independent groups publish adversarial robustness results against adaptive attackers within the next six months, that will determine whether this is a durable detection method or a first-mover that gets bypassed quickly.
Coverage we drew on
- RAS: Measuring LLM Safety Through Refusal Alignment · arXiv cs.CL
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.