Paper page - EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge

A comprehensive TTS benchmark, EmergentTTS-Eval, automates test-case generation and evaluation using LLMs and LALM to assess nuanced and semantically complex text in speech outputs.

Text-to-Speech (TTS) benchmarks often fail to capture how well models handle
nuanced and semantically complex text. Building on EmergentTTS, we
introduce EmergentTTS-Eval, a comprehensive benchmark covering six
challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic
complexity, complex pronunciation (e.g. URLs, formulas), and questions.
Crucially, our framework automates both test-case generation and evaluation,
making the benchmark easily extensible. Starting from a small set of
human-written seed prompts, we iteratively extend them using LLMs to target
specific structural, phonetic and prosodic challenges, resulting in 1,645
diverse test cases. Moreover, we employ a model-as-a-judge approach, using a
Large Audio Language Model (LALM) to assess the speech across multiple
dimensions such as expressed emotion, prosodic, intonational, and pronunciation
accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems,
such as 11Labs, Deepgram, and OpenAI’s 4o-mini-TTS, on EmergentTTS-Eval,
demonstrating its ability to reveal fine-grained performance differences.
Results show that the model-as-a-judge approach offers robust TTS assessment
and a high correlation with human preferences. We open source the evaluation
https://github.com/boson-ai/EmergentTTS-Eval-public{code} and the
https://huggingface.co/datasets/bosonai/EmergentTTS-Eval{dataset}.

Source link

What's Hot

White House plan signals “open-weight first” era—and enterprises need new guardrails

A new AI coding challenge just published its first results – and they aren’t pretty

Enhance generative AI solutions using Amazon Q index with Model Context Protocol – Part 1

Paper page – EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge

Paper page – HOComp: Interaction-Aware Human-Object Composition

Paper page – Does More Inference-Time Compute Really Help Robustness?

Paper page – RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback

Winston Artory Merger Targets $15B Art Valuation Market

Denver Museum Discovers 67.5 Million-Year-Old Fossil Under Parking Lot

Taipei Dangdai Cancels 2026 Edition

Barnes Foundation Online Learning Platform Expands to Penn Museum

White House plan signals “open-weight first” era—and enterprises need new guardrails

A new AI coding challenge just published its first results – and they aren’t pretty

Enhance generative AI solutions using Amazon Q index with Model Context Protocol – Part 1

What's Hot

Paper page – EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge

Related Posts

Subscribe to Updates