Research Products & Apps·arXiv cs.CL·Jun 24

Real-Time Voice AI Hears but Does Not Listen

A systematic evaluation of four production voice AI systems reveals a critical gap between perception and decision-making: GPT Realtime 2, Gemini 3.1 Flash Live, Qwen3.5 Omni Plus, and Omni Flash all demonstrate the ability to detect emotional subtext like distress, fear, and sarcasm when directly queried, yet consistently ignore these signals when executing consequential actions such as call termination, fund transfers, and enrollment. This disconnect exposes a fundamental architectural flaw in how real-time voice systems weight linguistic content over paralinguistic cues, raising urgent questions about safety guardrails in production systems handling sensitive transactions and vulnerable users.

Modelwire context

Explainer

The finding isn't simply that voice AI ignores emotion: it's that these systems demonstrably possess the relevant perceptual capacity and still fail to route it into decision logic, which means the problem is architectural prioritization, not missing capability. That distinction matters enormously for how engineers would actually fix it.

This connects loosely to the RevengeBench work published the same day (arXiv cs.LG, 2026-06-24), which frames the challenge of reconstructing hidden decision logic from observed behavior. That paper treats opaque decision-making as an inverse problem worth solving from the outside. The voice AI paper essentially confirms why that kind of external probing is necessary: even when you can query a system directly and get accurate self-reports about what it perceives, those perceptions may not be wired into the actions the system actually takes. The gap between what a model 'knows' and what it 'does' is a recurring structural problem across both papers, even if their domains differ.

Watch whether OpenAI or Google publish updated safety documentation for GPT Realtime 2 or Gemini Flash Live within the next two quarters that specifically addresses paralinguistic signal weighting in action-triggering pipelines. Silence from both would suggest the architectural fix is harder than a policy update.

Coverage we drew on

RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsOpenAI GPT Realtime 2 · Google Gemini 3.1 Flash Live · Alibaba Qwen3.5 Omni Plus · Alibaba Qwen3.5 Omni Flash

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.