AudioStory: Generating Long-Form Narrative Audio With Large Language Models - Takara TLDR

Recent advances in text-to-audio (TTA) generation excel at synthesizing short
audio clips but struggle with long-form narrative audio, which requires
temporal coherence and compositional reasoning. To address this gap, we propose
AudioStory, a unified framework that integrates large language models (LLMs)
with TTA systems to generate structured, long-form audio narratives. AudioStory
possesses strong instruction-following reasoning generation capabilities. It
employs LLMs to decompose complex narrative queries into temporally ordered
sub-tasks with contextual cues, enabling coherent scene transitions and
emotional tone consistency. AudioStory has two appealing features: (1)
Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser
collaboration into two specialized components, i.e., a bridging query for
intra-event semantic alignment and a residual query for cross-event coherence
preservation. (2) End-to-end training: By unifying instruction comprehension
and audio generation within a single end-to-end framework, AudioStory
eliminates the need for modular training pipelines while enhancing synergy
between components. Furthermore, we establish a benchmark AudioStory-10K,
encompassing diverse domains such as animated soundscapes and natural sound
narratives. Extensive experiments show the superiority of AudioStory on both
single-audio generation and narrative audio generation, surpassing prior TTA
baselines in both instruction-following ability and audio fidelity. Our code is
available at https://github.com/TencentARC/AudioStory

Source link

What's Hot

How to Leverage Job Hugging for HR

LegalOn Hires Former Litera Exec Vanessa Davis – Artificial Lawyer

StepWiser: Stepwise Generative Judges for Wiser Reasoning – Takara TLDR

AudioStory: Generating Long-Form Narrative Audio with Large Language Models – Takara TLDR

StepWiser: Stepwise Generative Judges for Wiser Reasoning – Takara TLDR

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference – Takara TLDR

MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment – Takara TLDR

Nazi-Looted Painting Spotted in Argentina Disappears: Morning Links

Artifacts From 2,000-Year-old Sunken City Lifted Out of the Sea

Fita Threatens Legal Action for Uni’s Trans-Inclusive Museum Guidance

Claire Oliver Gallery Expands in New York’s Harlem Neighborhood

How to Leverage Job Hugging for HR

LegalOn Hires Former Litera Exec Vanessa Davis – Artificial Lawyer

StepWiser: Stepwise Generative Judges for Wiser Reasoning – Takara TLDR

What's Hot

AudioStory: Generating Long-Form Narrative Audio with Large Language Models – Takara TLDR

Related Posts

Subscribe to Updates