Research Tools & Code·arXiv cs.LG·15h ago

TraceLab: Characterizing Coding Agent Workloads for LLM Serving

Researchers have released TraceLab, a dataset of 4,300 real coding-agent sessions capturing 350,000 LLM steps and 430,000 tool calls from production use of Claude Code and Codex. The trace reveals structural patterns in agentic workloads: extended autonomous reasoning loops, long context windows paired with sparse outputs, and heavily skewed tool-call distributions. This addresses a critical gap in LLM serving infrastructure research, where public benchmarks have lacked authentic multi-agent, multi-model usage data. For infrastructure teams and inference-optimization researchers, the dataset enables workload-aware system design and exposes why generic serving assumptions fail for coding agents.

Modelwire context

Analyst take

The more consequential detail buried in the methodology is that 4,300 sessions came from production use, not controlled lab conditions, meaning the skewed tool-call distributions and sparse output patterns reflect actual developer behavior rather than benchmark-optimized trajectories. That distinction matters enormously for anyone trying to build serving infrastructure that holds up under real load.

TraceLab lands alongside a cluster of work this week that collectively reframes what 'coding agent' actually means in practice. The SWE-INTERACT benchmark piece from the same day argues that existing evaluations miss iterative, user-driven workflows entirely, and TraceLab's production traces would be a natural complement to that testbed: one supplies realistic workload structure, the other supplies realistic task dynamics. Meanwhile, the Agents-A1 scaling paper showed that 45K-token trajectories are becoming operationally relevant, which makes TraceLab's finding about extended autonomous reasoning loops and long context windows feel less like an edge case and more like the new baseline infrastructure teams need to plan around.

Watch whether serving infrastructure teams at major cloud providers cite TraceLab in updated latency or batching benchmarks within the next two quarters. If they do, it signals the dataset has become a shared reference point rather than a one-off academic contribution.

Coverage we drew on

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsClaude Code · Codex · TraceLab · Anthropic

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.