Paper Page - ThinkSound: Chain-of-Thought Reasoning In Multimodal Large Language Models For Audio Generation And Editing

ThinkSound, a novel framework, uses Chain-of-Thought reasoning with a multimodal large language model to generate high-quality audio from videos, achieving state-of-the-art results in various benchmarks.

While end-to-end video-to-audio generation has greatly improved, producing
high-fidelity audio that authentically captures the nuances of visual content
remains challenging. Like professionals in the creative industries, such
generation requires sophisticated reasoning about items such as visual
dynamics, acoustic environments, and temporal relationships. We present
ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning
to enable stepwise, interactive audio generation and editing for videos. Our
approach decomposes the process into three complementary stages: foundational
foley generation that creates semantically coherent soundscapes, interactive
object-centric refinement through precise user interactions, and targeted
editing guided by natural language instructions. At each stage, a multimodal
large language model generates contextually aligned CoT reasoning that guides a
unified audio foundation model. Furthermore, we introduce AudioCoT, a
comprehensive dataset with structured reasoning annotations that establishes
connections between visual content, textual descriptions, and sound synthesis.
Experiments demonstrate that ThinkSound achieves state-of-the-art performance
in video-to-audio generation across both audio metrics and CoT metrics and
excels in out-of-distribution Movie Gen Audio benchmark. The demo page is
available at https://ThinkSound-Project.github.io.

Source link

What's Hot

OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System – Takara TLDR

Abu Dhabi Lands the Middle East’s First NVIDIA-Backed AI & Robotics lab

MIT Joins Wharton At The Top

Paper page – ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System – Takara TLDR

SPATIALGEN: Layout-guided 3D Indoor Scene Generation – Takara TLDR

Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs – Takara TLDR

St. Patrick’s Cathedral Unveils Monumental Mural by Adam Cvijanovic

Three Loaned Banksy Works Incite Dispute Between England and Italy

Major Collection of Old Masters Paintings Could Be Fractionalized

100 Must-See Artworks at the Metropolitan Museum of Art

OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System – Takara TLDR

Abu Dhabi Lands the Middle East’s First NVIDIA-Backed AI & Robotics lab

MIT Joins Wharton At The Top

What's Hot

Paper page – ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

Related Posts

Subscribe to Updates