From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

Researchers have developed a dataset-agnostic method to convert text-based tool-calling benchmarks into audio evaluations by applying text-to-speech, speaker variation, and noise injection while preserving original annotations. Testing across seven multimodal models reveals significant performance divergence: Gemini 3.1 Flash Live leads on Confetti (70.4%) while GPT Realtime 1.5 dominates When2Call (71.9%). This work addresses a critical gap in voice agent evaluation, where real-world deployment demands reliable tool use from speech but existing benchmarks remain text-centric. The framework's model and task-dependent results suggest voice agents require specialized tuning beyond text capabilities, signaling that audio modality introduces distinct failure modes insiders must account for in production systems.
Modelwire context
ExplainerThe headline finding isn't that voice agents underperform text agents (that's expected) but that no single model leads across both benchmarks, which suggests the audio modality exposes task-specific weaknesses rather than a uniform capability deficit. The framework's value is reproducibility: by preserving original annotations through the TTS conversion pipeline, other teams can audit their own models without building new datasets from scratch.
This connects directly to the MemEye coverage from the same day, which flagged a parallel problem in multimodal evaluation: benchmarks that let systems route around the modality being tested, answering visual questions through text shortcuts. Both papers are essentially arguing that evaluation infrastructure has lagged behind deployment reality. The 'Talk is Not Cheap' taxonomy audit reinforces the same structural concern from a security angle, showing that benchmark coverage gaps create false confidence. Taken together, these three papers from a single day suggest the field is in an active reckoning with whether its evaluation stack actually measures what production systems need to do.
Watch whether the Confetti and When2Call benchmarks get adopted by model providers in their own technical reports over the next two release cycles. If Gemini or OpenAI cite these specific numbers in product announcements, that signals the framework achieved the standardization the authors are aiming for.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsGemini 3.1 Flash Live · GPT Realtime 1.5 · Confetti · When2Call
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.