VideoModels & Releases Products & Apps·OpenAI (YouTube)·5h ago

We’re introducing three audio models in the API

OpenAI has released three production audio models that materially expand real-time voice capabilities for developers. GPT-Realtime-2 brings GPT-5-class reasoning to conversational AI, enabling more complex dialogue handling. GPT-Realtime-Translate covers 70+ input languages with live output in 13 languages, addressing a long-standing localization gap. GPT-Realtime-Whisper provides streaming transcription that keeps pace with natural speech. Together, these models signal OpenAI's shift toward multimodal, low-latency inference as a core platform offering, likely forcing competitors to accelerate similar voice stacks.

Modelwire context

Analyst take

The more consequential detail isn't the feature count but the pricing and latency specs OpenAI hasn't fully disclosed yet. Bundling GPT-5-class reasoning into a real-time voice API is only a competitive weapon if the inference cost doesn't price out the mid-market developers OpenAI needs to lock in before rivals do.

This lands five days after xAI's Custom Voices launch (covered May 2), which lowered voice cloning to a 60-second audio input and positioned voice synthesis as a standard developer primitive. OpenAI is now responding at the inference layer rather than the synthesis layer, betting that reasoning quality in live conversation is the harder moat to replicate. The LASE paper from May 1 is also relevant here: GPT-Realtime-Translate's 70-plus input language claim will face exactly the cross-script speaker identity problems that research identified, and OpenAI hasn't addressed how it handles accent-conditional degradation at scale. Mistral's Medium 3.5 consolidation (May 1) shows the broader industry trend toward fewer, more capable unified models, and OpenAI is applying that same logic to audio.

Watch whether xAI responds by extending its Custom Voices API with real-time translation within the next 60 days. If it does, that confirms voice is now a full-stack platform race rather than a feature differentiation play.

Coverage we drew on

xAI's new Custom Voices feature turns a minute of speech into a usable voice clone · The Decoder

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsOpenAI · GPT-Realtime-2 · GPT-Realtime-Translate · GPT-Realtime-Whisper · GPT-5

Read full story at OpenAI (YouTube) →(youtube.com)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on youtube.com. If you’re a publisher and want a different summarization policy for your work, see our takedown page.