While it’s still the early days, voice AI is one area where artificial intelligence promises cost savings and service improvements.
Deepgram is a decade-old company that quickly saw AI’s potential in voice. It is developing voice AI for enterprise use cases like call centers and interactive voice response (IVR) systems that millions access every day. To date, Deepgram has processed more than 50,000 years of audio and transcribed more than one trillion words.
Deepgram offers speech-to-text (STT), text-to-speech (TTS), and full speech-to-speech (STS) capabilities backed by an enterprise-grade runtime. More than 200,000 developers build on Deepgram’s voice-native foundational models that are accessed through cloud APIs or as self-hosted / on-premises APIs.
Voice AI is a massive opportunity
VP of Product Natalie Rutgers said more than 700 million customer service calls happen daily. Add more than 300 billion business calls, 75 million-plus drive-through orders and north of 35 million medical appointments, and you have opportunities for voice AI to improve processes and save employees for higher-level tasks. Drive-throughs alone are a billion-dollar market.
“Why are customers coming to us?” Rutgers asked. “They’re often coming to us for the things that are the biggest efficiency burns on their business. In the drive through space, the CTO of Jack in the Box said that integrating voice agents is going to be one of the most impactful initiatives for their business operations over the next five years.”
Yes, AI will take jobs away, but in some cases, they’re jobs that are hard to fill. Call centers have turnover rates. That increases training and recruiting while sapping productivity. Introduce speech-to-speech AI, and those costs come down.
Yes, AI will take jobs away, but in some cases, they’re jobs that are hard to fill
Ten years ago, contact centers generated massive amounts of pre-recorded daily calls that had to be analyzed and transcribed. As staff turnover increased, institutional memory suffered due to a lack of customer familiarity. Companies struggled to understand those conversations and interact in real-time.
Real-time interactions bring challenges and opportunities.
Rutgers said Deepgram focuses on real-time interactions. That’s a key difference from many competitors who focus on select, almost pre-determined, use cases.
“(With podcasts, for example), an audio designer can sit for hours and make sure the end voice has exactly the personality and the expressiveness they want in their content,” Rutgers explained. “When you’re generating a voice on the fly to have a conversation (in real-time), you don’t get that time.”
Real-time voice AI conversations must address several issues that come naturally to successful human conversations. One is contending with accents. Deepgram works with partners to access calls and accents they deal with, along with industry-specific jargon like financial or medical terms.
Each company’s model is unique to them; no one else gets license to it. Models are often deployed in a virtual private cloud or on-premise, so data doesn’t leave the environment and remains compliant. Deepgram also manages and scales customer deployments. That’s becoming a valuable service in the United Kingdom as data privacy tightens.
Making voice AI conversations sound more natural
The voice AI industry is slowly chipping away at making AI-generated conversations sound more natural. Natural conversations have 200-500 milliseconds of latency. Today’s industry-best solutions are between 800 and 1,200. Once latency is addressed, conversation quality will receive more attention.
“We’re not only measuring the real-time latency, but also how often AI is tending to interrupt you and reducing the humanness of what you would take for granted in a conversation, because the AI is not giving you that right now,” Rutgers said.
Any successful voice AI deployment in finance must address unique challenges. Systems can struggle with numbers, dollar signs and alphanumerics. Systems can read “3:00 p.m.” as “300 p.m.” and “$5.7 million” as “five dollars and seven cents.”
“These are issues many voice AI companies don’t understand,” Rutgers said. “We deeply understand why a model might hallucinate in this way, what you need to overcome it, and how you can have a successful deployment.”
Preparing customers for voice AI
Companies can’t just thrust a voice AI system on their customers; they have to prepare them ahead of time. Rutgers said that begins with understanding who their customers are and what they expect. A pharmacy chain, for example, learned that many of its senior customers memorized its touchtone menu to expedite the process. It tweaked many of the questions and answers to offer a more natural flow.
“In the financial space, something similar can be done,” Rutgers said. “If someone’s used to calling in, what sorts of questions are they used to being asked, how might they answer them even a little bit more naturally, and have a couple more back-and-forth questions just to start. But as that gets adoption and the retention rates are good, then they can continue to evolve it.”
“It’s not just a technology shift; it’s a behavior shift with your end users as well. As the voices get more natural, and the conversations are much more fluid, the spaces where there is a need to be much more operationally efficient, and there’s a lot of scale and volume, that’s who’s being most successful.”
As the voices get more natural, and the conversations are much more fluid, the spaces where there is a need to be much more operationally efficient, and there’s a lot of scale and volume, that’s who’s being most successful
While ChatGPT and Anthropic have introduced many to AI, and they have their place, Rutgers said they shouldn’t be a go-to for conversational solutions where context is important.
The voice AI industry is in the early stages, with industry chatter centering on perfecting the most obvious aspects of conversations, like response times, natural reactions, and flexibility. Rutgers said that will make the difference between customers asking for a human and sticking with an AI system.
“Over the last couple of years, we’ve also added the voice so that you can speak back, integrate those voices, but also have an end-to-end, speech-to-speech thinking system that allows you to listen, think and speak just as naturally as a human would,” Rutgers said.