Paper page - DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis

DMOSpeech 2 optimizes duration prediction and introduces teacher-guided sampling to enhance speech synthesis performance and diversity.

Diffusion-based text-to-speech (TTS) systems have made remarkable progress in
zero-shot speech synthesis, yet optimizing all components for perceptual
metrics remains challenging. Prior work with DMOSpeech demonstrated direct
metric optimization for speech generation components, but duration prediction
remained unoptimized. This paper presents DMOSpeech 2, which extends metric
optimization to the duration predictor through a reinforcement learning
approach. The proposed system implements a novel duration policy framework
using group relative preference optimization (GRPO) with speaker similarity and
word error rate as reward signals. By optimizing this previously unoptimized
component, DMOSpeech 2 creates a more complete metric-optimized synthesis
pipeline. Additionally, this paper introduces teacher-guided sampling, a hybrid
approach leveraging a teacher model for initial denoising steps before
transitioning to the student model, significantly improving output diversity
while maintaining efficiency. Comprehensive evaluations demonstrate superior
performance across all metrics compared to previous systems, while reducing
sampling steps by half without quality degradation. These advances represent a
significant step toward speech synthesis systems with metric optimization
across multiple components. The audio samples, code and pre-trained models are
available at https://dmospeech2.github.io/.

Source link

What's Hot

Perplexity AI Picks Bharat Over Big Tech

Overcoming Risks from Chinese GenAI Tool Usage

Chinese Repair Shops Quietly Keep Smuggled Nvidia AI Chips Alive, Fixing Up To 500 Banned H100 GPUs Per Month As US Scrambles To Track Them: Report – NVIDIA (NASDAQ:NVDA)

Paper page – DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis

Paper page – LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

Paper page – TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation

Paper page – Captain Cinema: Towards Short Movie Generation

David Geffen Sued By Estranged Husband for Breach of Contract

Auction House Will Sell Egyptian Artifact Despite Concern From Experts

Anish Kapoor Lists New York Apartment for $17.75 M.

Street Fighter 6 Community Rocked by AI Art Controversy

Perplexity AI Picks Bharat Over Big Tech

Overcoming Risks from Chinese GenAI Tool Usage

Chinese Repair Shops Quietly Keep Smuggled Nvidia AI Chips Alive, Fixing Up To 500 Banned H100 GPUs Per Month As US Scrambles To Track Them: Report – NVIDIA (NASDAQ:NVDA)

What's Hot

Paper page – DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis

Related Posts

Subscribe to Updates