OpenAI Expands into Next-Gen Audio AI With Three New Models

Share this article
Share this article
Prioritise Us on Google
OpenAI API launches realtime audio for AI agents. Credit: OpenAI
OpenAI has launched three new audio AI models for developers to build real-time voice agents that can listen, reason, respond and act naturally

OpenAI has released three audio models designed to handle real-time voice interactions. GPT-Realtime-2, GPT-Realtime-Translate and GPT-Realtime-Whisper could enable software systems to process spoken requests and respond whilst conversations are still taking place.

The models target developers building applications where users need to communicate by voice rather than text. This could include scenarios where typing is impractical or where live translation and transcription are required.

Voice models with reasoning

GPT-Realtime-2 is the first voice model from OpenAI to include reasoning capabilities from its GPT-5 class architecture. The system can process requests whilst maintaining conversational flow.

Youtube Placeholder

It handles interruptions and corrections during live interactions. The model calls tools and adjusts responses based on the context of the conversation as it unfolds.

The system could allow applications to move from simple voice commands to interactions where software reasons through multi-step requests. Developers are building voice-to-action patterns where systems complete tasks after interpreting spoken instructions.

A system-to-voice mode enables applications to convert content into spoken output. Users could search for travel options conversationally whilst the system manages connected tasks like rebooking hotels after flight changes.

Translation across 70 languages

GPT-Realtime-Translate processes speech from more than 70 input languages into 13 output languages. The model translates whilst speakers continue talking.

Three ways to build with voice AI. Credit: OpenAI

Deutsche Telekom is building customer support systems where users speak in their preferred language and the model translates the conversation in real time. The approach targets support operations, education platforms and media services with international audiences.

Vimeo uses the model to translate product education videos as they play. Global customers can hear content in their chosen language without waiting for separately produced versions.

Cobus Kok, VP AI Experiences at Priceline, says: "GPT-Realtime-2 stood out for how well it handles complex requests, coordinates multiple tool calls at once, and keeps the interaction feeling natural."

Cobus Kok, VP AI Experiences at Priceline

"For Penny, Priceline’s AI travel agent, that translates into quicker, more practical support by voice – especially when travellers need to adjust plans in real time."

Live transcription and workflow

GPT-Realtime-Whisper converts speech to text as speakers talk. The streaming transcription model operates with low latency.

The system could generate captions for broadcasts or produce summaries while meetings are in progress. Teams can integrate live speech into business workflows without waiting for recordings to finish processing.

Prateek Sachan, Co-Founder and CTO at BolnaAI

Prateek Sachan, Co-Founder and CTO at BolnaAI, says: "Building voice AI for India means handling diverse regional phonetics. GPT-Realtime-Translate delivered 12.5% lower word error rates than any other model we tested."

OpenAI has designed the models specifically for applications where voice is the primary interface. The company expects developers to build experiences where users can speak naturally whilst software handles tasks in real time across various use cases.

The Realtime API includes multiple layers of controls to prevent misuse. Active classifiers can halt sessions that violate content guidelines during live interactions.

Usage policies prohibit distribution of outputs for spam or deceptive purposes, though it falls on the developer to ensure end users know when they are interacting with AI systems.

Executives