How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Researchers analyzed token consumption across eight frontier LLMs running agentic coding tasks on SWE-bench Verified, finding agentic workflows burn 1000x more tokens than traditional code reasoning. The study also evaluates whether models can predict their own token costs before execution, offering practical insights for teams deploying cost-sensitive AI agents.
Modelwire context
ExplainerThe more underreported finding is the self-prediction angle: the paper tests whether models can estimate their own token consumption before a task runs, which is a prerequisite for any meaningful cost-gating or budget-aware orchestration layer. That capability gap, not just the raw consumption numbers, is what determines whether agentic systems can be deployed responsibly at scale.
This is largely disconnected from recent activity in our archive, as we have no prior coverage to anchor it to. It belongs to a growing body of work on agentic infrastructure costs, sitting alongside practical concerns about SWE-bench Verified as a proxy for real-world deployment. The benchmark itself has attracted scrutiny for rewarding brute-force context use, and this paper implicitly confirms that concern by showing how aggressively frontier models consume tokens when given agentic freedom. The 1000x multiplier reframes cost not as a pricing footnote but as an architectural constraint teams need to design around from the start.
Watch whether any of the eight frontier labs studied respond with model-level cost-prediction APIs or pre-execution token estimation features within the next two quarters. If they do, this paper will have functioned as direct product pressure; if not, the cost-opacity problem stays with the orchestration layer and third-party tooling fills the gap.
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsSWE-bench Verified
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.