Move Over, Alexa: Amazon Launches New Realtime Voice Model Nova Sonic For Third-party Enterprise Development

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Amazon is best known as an e-commerce giant and then somewhere perhaps slightly further down the list of notable offerings is its Alexa AI voice assistant product, which just got a big intelligence upgrade last month thanks in part to Amazon Nova and Amazon’s investment Anthropic.

Now Alexa will have to make space for a new Amazon voice AI sibling: today the company is introducing Amazon Nova Sonic, a new foundation model designed to allow third-party app developers to build realtime, naturalistic, conversational voice interactivity to their products using Amazon’s web platform Bedrock.

It’s available now via a bi-directional streaming application programming interface (API). And actually, Amazon has already incorporated some portions of it — a speech encoder that provides representation and a speech synthesizer — into the new Alexa model, Alexa+.

“This approach allows us to bring the benefits of our speech technologies to different use cases simultaneously while continuing to evolve both systems based on customer feedback and technological advancements,” a spokesperson told us.

Obvious use cases include customer support and service, guidance, information retrieval, and entertainment.

A unified approach

Nova Sonic addresses a key challenge in voice AI: the fragmentation of technologies.

Traditionally, building voice interfaces required combining separate models for speech recognition, language processing, and speech synthesis, according to Rohit Prasad, SVP and Head Scientist for Artificial General Intelligence (AGI) at Amazon, in a video call interview with VentureBeat yesterday using Amazon’s Chime video service.

This complexity often results in robotic, unnatural interactions and increased development overhead.

Now, Sonic seeks to improve on this state of affairs by combining all three distinct model types into one.

Prasad explained the model’s core innovation: “Nova Sonic brings together three traditionally separate models—speech-to-text, text understanding, and text-to-speech—into one unified system that can model not just the ‘what’ but also the ‘how’ of communication.”

By retaining the acoustic context—such as tone, cadence, and style—Nova Sonic helps maintain the nuances of human conversation.

Recognizing the intricacies and quirks of live, two-way audio conversations

One of Nova Sonic’s defining capabilities is its ability to handle live, two-way conversations. It recognizes when users pause, hesitate, or interrupt—common behaviors in human speech—and responds fluidly while maintaining context.

“The real breakthrough here is real-time, interactive, low-latency voice interaction, which means you can interrupt the AI mid-sentence, and it will still maintain context and respond coherently,” said Prasad. This feature is especially relevant in scenarios like customer service, where responsiveness and adaptability are critical.

Nova Sonic is also designed to integrate seamlessly with other systems. It automatically generates transcripts of spoken input, which can be used to trigger APIs or interact with proprietary tools. This allows companies to build AI agents that can perform tasks such as booking appointments, retrieving live information, or answering complex customer inquiries.

“You can use Nova Sonic through Amazon Bedrock and connect it with any tools or proprietary data sources, even visual ones, as long as they’re wrapped as callable APIs,” said Prasad. This flexibility makes the model suitable for a wide range of industries, from education and travel to enterprise operations and entertainment.

Benchmark performance and industry comparisons

Nova Sonic has been benchmarked against other real-time voice models, including OpenAI’s GPT-4o and Google’s Gemini Flash 2.0. On the Common Eval data set, it achieved a 69.7% win-rate over Gemini Flash 2.0 and a 51.0% win-rate over GPT-4o for American English single-turn conversations using a masculine voice. Similar gains were seen with feminine and British English voices.

Prasad emphasized Nova Sonic’s strong performance in its primary language markets: “Nova Sonic is currently best-in-class in U.S. and British English, outperforming even GPT-4o real-time in both conversational naturalness and accuracy.” He added, “To the best of our knowledge, only two other models—GPT-4o real-time and a variant of GPT-4o mini—come close to what Nova Sonic does in combining speech understanding and generation in real time. This space is still very early and very hard.”

Multilingual capabilities and noisy environment handling

In speech recognition, Nova Sonic also excels in multilingual and real-world conditions. It recorded a word error rate (WER) of 4.2% on the Multilingual LibriSpeech benchmark, outperforming GPT-4o Transcribe by over 36% across English, French, German, Italian, and Spanish. In noisy, multi-speaker environments (measured using the AMI benchmark), Nova Sonic showed a 46.7% improvement in WER over GPT-4o Transcribe.

Expressive voices and language expansion

Currently, the model supports multiple expressive voices, both masculine and feminine, in American and British English. Amazon noted that additional accents and languages are in development and will be released in future updates.

Low latency and enterprise-friendly cost

Speed and cost are also part of the appeal. Third-party benchmarking shows Nova Sonic delivers a customer-perceived latency of 1.09 seconds, compared to 1.18 seconds for OpenAI’s GPT-4o and 1.41 seconds for Google’s Gemini Flash 2.0.

From a pricing standpoint, Amazon positions Nova Sonic as an enterprise-ready solution. “We’re nearly 80% cheaper than GPT-4o real-time, and that superior price-performance is resonating with enterprises moving from experimentation to deployment,” said Prasad.

Early adoption across sectors

According to Amazon, companies across different sectors have already begun using or testing Nova Sonic.

ASAPP is applying the technology to optimize contact center workflows, praising its accuracy and natural dialog handling.

Education First (EF) uses the model to support language learners with real-time pronunciation feedback, especially for non-native speakers with varied accents.

Sports data provider Stats Perform is leveraging Nova Sonic’s low latency and simple setup to power rapid, data-rich interactions in its Opta AI Chat platform.

Responsible AI and safety commitment

Alongside performance and cost, Amazon is highlighting its commitment to responsible AI development. The Nova family of models includes built-in safeguards and is supported by AWS AI Service Cards that outline intended use cases, potential limitations, and ethical guidelines.

Prasad underscored Amazon’s focus on trust and safety: “Trust is paramount for us—developers can customize personality within limits, but we’ve put in strong guardrails to prevent voice cloning or unwanted mimicry.” He added, “We work extremely hard to eliminate hallucinations and voice drift. The bar we’ve set for release is high because speech generation must be trustworthy.”

Amazon Nova Sonic is now generally available through Amazon Bedrock. Developers and enterprises interested in exploring the model can get started by visiting https://aws.amazon.com/nova/.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source link

What's Hot

Nvidia to invest $100 billion in OpenAI to help expand computing power

MIT Affiliates Secure AI Grants for Math Discovery

Nvidia To Invest Up To $100B In OpenAI

Move over, Alexa: Amazon launches new realtime voice model Nova Sonic for third-party enterprise development

Software is 40% of security budgets as CISOs shift to AI defense

How Intuit killed the chatbot crutch – and built an agentic AI playbook you can copy

Forget data labeling: Tencent’s R-Zero shows how LLMs can train themselves

St. Patrick’s Cathedral Unveils Monumental Mural by Adam Cvijanovic

Three Loaned Banksy Works Incite Dispute Between England and Italy

New Collectors Drive Strong Sales at New York Fair

Hidden Portrait May Be Vermeer’s Earliest Known Work