
Follow ZDNET: Add us as a preferred source on Google.
ZDNET’s key takeaways
OpenAI’s Realtime API is now optimized and generally available.You can try its latest speech-to-speech model, gpt-realtime.The upgrades improve OpenAI’s voice offerings for developers.
This year, AI agents that can carry out tasks on behalf of users have been a major focus, with companies constantly developing offerings that reduce the user’s workload. To make these interactions as seamless as possible, many companies are leaning on multimodal AI agents, and OpenAI is making developing these products even easier.
Also: 3 smart ways business leaders can build successful AI strategies – before it’s too late
According to the company, OpenAI updated its Realtime API, now generally available, on Thursday, with new features that allow developers and enterprises to build more reliable voice agents. OpenAI first launched the Realtime API in October 2024 in public beta. Additionally, the company released its most advanced speech-to-speech model yet, called gpt-realtime.
“We view that voice is the next medium. People will prefer to talk, walk through exactly what they’re doing, and sometimes it’s just easier and more natural to convey in voice than it is to be able to do so in text,” said Miqdad Jaffer, product at OpenAI to ZDNET.
The releases:
RealTime API updates
What: The upgrades to the Realtime API include support for remote Model Context Protocol (MCP) servers, image inputs, and phone calling through Session Initiation Protocol (SIP), according to the release. During a livestream for the announcement, OpenAI mentioned that MCP is well-suited to voice commands, enabling users to seamlessly perform actions from connected apps.Why it matters: Ultimately, these expanded capabilities should enable voice agents to access more tools and have more context to assist users. AI tools are only as helpful as the information they give, so streamlining the process of connecting AI models to data sources is a big win for developers and users alike. Most importantly, the MCP open-standard ensures that the connections are made, prioritizing user data and privacy.
A new speech-to-speech model
What: OpenAI touted its new gpt-realtime model as the company’s “most advanced, production-ready voice model.” Upgrades include improvements in intelligence, complex instruction following, and function calling. It can also switch languages in the middle of a sentence. A demo of the model showed how human-like the model is, complete with inflections that represent a wide range of emotions. The model also appeared to successfully follow instructions — an OpenAI employee simulated a jailbreak attempt by contradicting the system prompt, but gpt-realtime calmly redirected and did not succumb to the attempts. It also analyzed a photo and chatted about what it was seeing. The aforementioned feature is actually one of Jaffer’s favorites. “The thing that I think is most exciting is the instruction following. I think the key to being able to build with models is to be able to reliably give a set of instructions and have the model consistently follow those things out,” said Jaffer. OpenAI also added two new voices, Cedar and Marin, that are exclusively available in the API.Why it matters: A key tenet of helpful voice assistance and interactions is models that sound natural and can actually help with tasks. If the new model works as claimed, it will enable a better experience for users.
Getting started
Starting Thursday, the Realtime API and the new gpt-realtime model are available to all developers. Developers are invited to test the model in the Playground and read the Realtime API documentation to help them make their decision.
When asked what developers should consider, Jaffer added, “Do what’s best for your user, and one of the things that’s best for your user is being able to interact in a modality that’s comfortable and that’s easy, and we believe voice is that future.”
Want to follow my work? Add ZDNET as a trusted source on Google.