ProsodyLM was pre-trained on 30,000 hours of audiobooks and ultimately demonstrated better prosody understanding than prior models across various categories. ProsodyLM could, for example, correctly recognize emotion and stress in speech utterances, without being trained to perform those tasks. By explicitly tokenizing the prosody information and content, the resulting language model can generate very expressive speech, develop a preliminary understanding of emphasis and emotion and successfully clone the styles in reference speech.
“Now, instead of AI commentators that speak in a monotone level of excitement and sound unnatural to audiences, these tools can express a high excitement level just like human commentators, who get much more expressive during a very exciting rally,” Zhang said.
Looking forward, once the prototype has advanced to production and excitement-driven sports commentary is rolled out in official tennis tournaments, a next step could be letting fans personalize the sports commentary, Feris said. For example, fans could decide if they want high versus low excitement commentary. In the meantime, Zhang said, the team is receiving a lot of interest from researchers and clients working on other sports like Formula 1 car racing.
In addition, this excitement-driven AI sports commentary was part of an IBM “Behind the Scenes” 2025 US Open demo this year of emerging tennis technologies. This means that in the not-so-distant future, you might want to tune in more closely to see if you can detect whether it’s a human or an AI announcer whipping up the crowd after an overhead smash or a tricky drop shot.