Tool Calling is Linearly Readable and Steerable in Language Models

Researchers have discovered that tool selection in language models operates through linearly separable activation patterns, enabling both prediction and intervention. By measuring the difference in internal activations between tools, they can steer model behavior to switch tool choices at 77-100% accuracy across multiple architectures, with downstream JSON arguments automatically conforming to the new tool's schema. This finding has immediate practical value for safety: activation gaps between top tool candidates correlate with error likelihood, potentially allowing systems to flag uncertain decisions before execution. The work spans 12 instruction-tuned models from 270M to 27B parameters, suggesting the phenomenon is robust across scale.

Modelwire context

Explainer

The more consequential finding may not be the steering accuracy itself but the downstream conformance: when researchers redirect a model to a different tool mid-generation, the JSON arguments it produces automatically reshape to fit the new tool's schema, without any additional instruction. That suggests tool selection and argument generation are more tightly coupled in representational space than the modular framing of most agentic frameworks assumes.

This sits directly alongside two threads in recent Modelwire coverage. The piece on 'Mechanistic Interpretability Must Disclose Identification Assumptions' (also from May 8) is immediately relevant: that paper warns against treating activation-level interventions as causal evidence without stating identification assumptions, and this tool-calling work is precisely the kind of study that disclosure norm targets. The susceptibilities paper from the same day adds another angle, showing that activation steering can validate interpretability claims in RL settings. Together, the three papers form an informal cluster around a shared question: when we intervene on model internals and behavior changes, what have we actually proven?

Watch whether any of the four model families tested (Gemma 3, Qwen 3, Llama 3.1) ship a production safety feature citing activation gap monitoring within the next two release cycles. If that happens, this moves from interpretability research into deployment infrastructure; if it doesn't, the gap between lab finding and engineering adoption remains the real story.

Coverage we drew on

Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGemma 3 · Qwen 3 · Qwen 2.5 · Llama 3.1

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.