GPT-Realtime-2 brings GPT-5-class reasoning to live voice. A separate translation model covers 70+ input languages. A streaming Whisper variant handles transcription. The pricing is aggressive enough to make the comparison unavoidable.
OpenAI released three new voice models in its API, broadening the range of surfaces where developers can plug GPT-class reasoning into live audio.
The three are GPT-Realtime-2, a successor to the company’s existing realtime voice model with what OpenAI describes as GPT-5-class reasoning; GPT-Realtime-Translate, a live translation model with more than 70 input and 13 output languages; and GPT-Realtime-Whisper, a streaming speech-to-text model built for low-latency transcription.
The release lands in the middle of a voice-AI build-out that the rest of the industry has spent the past year staffing for. Enterprises that have shipped voice agents have done so on a stack of stitched-together components: Whisper or Deepgram for transcription, ElevenLabs or Cartesia for text-to-speech, GPT-4 or Claude for the reasoning step, and bespoke turn-taking and barge-in logic in the middle.
What OpenAI is offering with GPT-Realtime-2 is a single model that handles audio in and audio out, with reasoning happening inside the audio loop rather than between transcription and synthesis steps.
What is actually new?
GPT-Realtime-2 picks up several capabilities that production voice teams have been simulating with prompt scaffolding. Preambles let an agent say “let me check that” while it calls tools, so users do not sit through silence.
Parallel tool calls let the model fire multiple back-end requests simultaneously and narrate which one is in flight. Recovery behaviour catches failures and surfaces them rather than freezing the conversation.
The model can adjust tone deliberately, calmer for support cases, more upbeat for confirmations.
Two underlying numbers carry most of the weight. The context window is now 128K, up from 32K, which makes longer sessions and complex agentic flows feasible without external state stitching.
Reasoning effort is exposed as a knob: minimal, low, medium, high, and xhigh, with low set as the default to keep latency tight.
On OpenAI’s own benchmarks, GPT-Realtime-2 at high effort scores 15.2% higher than GPT-Realtime-1.5 on Big Bench Audio, the company’s audio-reasoning benchmark, and 13.8% higher on Audio MultiChallenge for instruction following at xhigh effort. Customer benchmarks are sharper.
Zillow reports a 26-point lift in call-success rate on its hardest adversarial benchmark, from 69% on the prior model to 95% on GPT-Realtime-2. BolnaAI, a voice-AI company building for Indian languages, reports 12.5% lower word error rates on Hindi, Tamil and Telugu using the translation model.
GPT-Realtime-2 is priced at $32 per million audio-input tokens, $0.40 for cached input tokens, and $64 per million audio-output tokens. GPT-Realtime-Translate is priced at $0.034 per minute. GPT-Realtime-Whisper is priced at $0.017 per minute.
Translation pricing is the line that puts the rest of the industry on alert. At a third of a cent per minute, GPT-Realtime-Translate undercuts the per-minute pricing on most enterprise translation pipelines by a wide margin, and bundles latency and language coverage that cost-conscious deployments have historically had to compromise on. Whisper streaming at half that price is similarly aggressive.
ElevenLabs, the most-funded pure-play voice company in the market and a recent participant in seed rounds for Twilio’s Q1 voice-AI revenue surge, and other voice-adjacent infrastructure, prices its voice agents on a per-minute model that bundles synthesis with model inference.
The arithmetic for buyers gets harder when OpenAI’s bundled model is also doing the reasoning. Deepgram, which sells the streaming-transcription primitive directly, faces a similar squeeze on the Whisper-streaming side.
OpenAI’s launch list reads like a product-marketing version of the voice-agent customer landscape: Zillow, Glean, Genspark, Bluejay, Intercom, Priceline and Foundation Health for the realtime model; BolnaAI, Vimeo and Deutsche Telekom for translation.
None of the three models removes the build work around guardrails, evaluation, escalation and analytics that voice agents need before they go live.
OpenAI ships active classifiers and EU data residency, but the integration burden of compliance, brand voice and tool-call observability stays with the developer.
The competitive question is which platform reduces that burden fastest, and OpenAI’s bet is that doing the audio reasoning inside one model is more defensible than stitching three vendors together.
Whether ElevenLabs, Deepgram and the rest can hold their wedge depends on how quickly they push their own integrated stacks. ElevenLabs’ Series D in February at an $11bn valuation was raised explicitly on the agent thesis; Deepgram has been moving in the same direction. T
he next quarter is the first time the comparison will be made on production workloads rather than on demos.
For now, the immediate test is a Playground tab and an SDK call away. The price card and the benchmarks suggest OpenAI is not waiting.
Get the TNW newsletter
Get the most important tech news in your inbox each week.
