Alibaba Introduces ThinkSound: An AI Model Generating Realistic Audio for Videos

Published on July 16, 2025

Creating high-quality audio for video content presents numerous technical and creative challenges, impacting both novices and experienced audio professionals. Producers often grapple with issues like noise management, balancing dialogue with sound effects, meeting budgetary and time constraints, and maintaining creative consistency. Translating artistic vision into a cohesive final product that accurately reflects visual dynamics, acoustic environments, and timing also remains challenging.

To address these challenges, Alibaba’s Tongyi Speech Lab has introduced ThinkSound, a novel open-source multimodal LLM utilizing Chain-of-Thought (CoT) reasoning for advanced audio generation and editing. ThinkSound offers a structured, interactive approach to audio production, specifically tailored for video content. The model, available in three compact sizes – 1.3B, 724M, and 533M parameters – supports video-to-audio generation, text-based audio editing, and interactive audio creation, even on edge devices.

ThinkSound mimics the multi-stage workflow of human sound designers, ensuring generated audio remains contextually accurate, cohesive, and high quality throughout production. The model first analyzes a video’s visual dynamics, logically interprets corresponding acoustic attributes, and then synthesizes contextually appropriate audio.

Through its innovative approach, ThinkSound enables users to create detailed and coherent soundscapes, refine generated audio through intuitive user interactions, and edit specific audio segments using natural language instructions, effectively bridging the gap between creative intention and automated audio production.

Additionally, Alibaba’s research team introduced AudioCoT, a large-scale multimodal dataset featuring audio-specific CoT annotations, enhancing the alignment between visual content, textual descriptions, and sound synthesis.

Extensive evaluations have demonstrated that ThinkSound achieves state-of-the-art performance in video-to-audio generation, delivering contextually accurate and precisely timed soundscapes. The model excels in traditional audio quality metrics and CoT-based evaluations. Furthermore, on the MovieGen Audio Bench – a benchmark assessing video audio-generation capabilities – ThinkSound significantly outperforms other leading models.

Thinksound 1 — Comparison of our ThinkSound foundation model with existing video-to-audio baselines on the VGGSound test set. ↓ indicates lower is better, ↑ indicates higher is better.

ThinkSound can seamlessly integrate with various video-generation models to provide realistic voiceovers and soundtracks for synthesized videos. Its sophisticated audio-generation capabilities offer significant potential applications in film and television sound design, audio post-production, and immersive sound experiences for gaming and virtual reality.

ThinkSound is now available open source on Hugging Face, GitHub and Alibaba’s Model Studio.

Thinksound 2 — Out-of-distribution evaluation on MovieGen Audio Bench.

Source link

What's Hot

Blacklisted by the U.S. and backed by Beijing, this Chinese AI startup has caught OpenAI’s attention – NBC Bay Area

Enabling customers to deliver production-ready AI agents at scale

Morning Links for July 16, 2025

Alibaba Introduces ThinkSound: An AI Model Generating Realistic Audio for Videos

The Power Of Google Veo 3 API With Veo3API.ai

Raiinmaker Collaborates with Google Cloud to Unlock a New Standard for Ethical AI Video

Google’s generative video model Veo 3 has a subtitles problem

Morning Links for July 16, 2025

Justin Sun, Billionaire Banana Buyer, Buys $100 M. of Trump Memecoin

WeTransfer Changes Terms of Service After Criticism on Licensing

Artist is Turning Greyhound Bus into Museum of the Great Migration

Blacklisted by the U.S. and backed by Beijing, this Chinese AI startup has caught OpenAI’s attention – NBC Bay Area

Enabling customers to deliver production-ready AI agents at scale

Morning Links for July 16, 2025

What's Hot

Alibaba Introduces ThinkSound: An AI Model Generating Realistic Audio for Videos

Related Posts

Subscribe to Updates