Paper Page - SoloSpeech: Enhancing Intelligibility And Quality In Target Speech Extraction Through A Cascaded Generative Pipeline

SoloSpeech, a cascaded generative pipeline, improves target speech extraction and speech separation by addressing artifact introduction, naturalness reduction, and environment mismatches, achieving state-of-the-art intelligibility and quality.

Target Speech Extraction (TSE) aims to isolate a target speaker’s voice from
a mixture of multiple speakers by leveraging speaker-specific cues, typically
provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in
TSE have primarily employed discriminative models that offer high perceptual
quality, these models often introduce unwanted artifacts, reduce naturalness,
and are sensitive to discrepancies between training and testing environments.
On the other hand, generative models for TSE lag in perceptual quality and
intelligibility. To address these challenges, we present SoloSpeech, a novel
cascaded generative pipeline that integrates compression, extraction,
reconstruction, and correction processes. SoloSpeech features a
speaker-embedding-free target extractor that utilizes conditional information
from the cue audio’s latent space, aligning it with the mixture audio’s latent
space to prevent mismatches. Evaluated on the widely-used Libri2Mix dataset,
SoloSpeech achieves the new state-of-the-art intelligibility and quality in
target speech extraction and speech separation tasks while demonstrating
exceptional generalization on out-of-domain data and real-world scenarios.

Source link

What's Hot

AI Agents + What’s Next for Legal Judgment – Artificial Lawyer

P3-SAM: Native 3D Part Segmentation – Takara TLDR

Stability AI Launches Stable Audio 2.5 with Enterprise-Grade Speed and Creative Control

Paper page – SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

P3-SAM: Native 3D Part Segmentation – Takara TLDR

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants – Takara TLDR

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning – Takara TLDR

Christie’s Will Auction The First Calculating Machine In History

The Art Market Isn’t Dying. The Way We Write About It Might Be.

Banksy Mural of Judge Beating Protestor Removed by Courts Service

Death of Matthew Christopher Pietras Ruled a Suicide

AI Agents + What’s Next for Legal Judgment – Artificial Lawyer

P3-SAM: Native 3D Part Segmentation – Takara TLDR

Stability AI Launches Stable Audio 2.5 with Enterprise-Grade Speed and Creative Control

What's Hot

Paper page – SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

Related Posts

Subscribe to Updates