Alibaba Introduces ThinkSound: An AI Model Generating Realistic Audio for Videos

Published on July 16, 2025

Creating high-quality audio for video content presents numerous technical and creative challenges, impacting both novices and experienced audio professionals. Producers often grapple with issues like noise management, balancing dialogue with sound effects, meeting budgetary and time constraints, and maintaining creative consistency. Translating artistic vision into a cohesive final product that accurately reflects visual dynamics, acoustic environments, and timing also remains challenging.

To address these challenges, Alibaba’s Tongyi Speech Lab has introduced ThinkSound, a novel open-source multimodal LLM utilizing Chain-of-Thought (CoT) reasoning for advanced audio generation and editing. ThinkSound offers a structured, interactive approach to audio production, specifically tailored for video content. The model, available in three compact sizes – 1.3B, 724M, and 533M parameters – supports video-to-audio generation, text-based audio editing, and interactive audio creation, even on edge devices.

ThinkSound mimics the multi-stage workflow of human sound designers, ensuring generated audio remains contextually accurate, cohesive, and high quality throughout production. The model first analyzes a video’s visual dynamics, logically interprets corresponding acoustic attributes, and then synthesizes contextually appropriate audio.

Through its innovative approach, ThinkSound enables users to create detailed and coherent soundscapes, refine generated audio through intuitive user interactions, and edit specific audio segments using natural language instructions, effectively bridging the gap between creative intention and automated audio production.

Additionally, Alibaba’s research team introduced AudioCoT, a large-scale multimodal dataset featuring audio-specific CoT annotations, enhancing the alignment between visual content, textual descriptions, and sound synthesis.

Extensive evaluations have demonstrated that ThinkSound achieves state-of-the-art performance in video-to-audio generation, delivering contextually accurate and precisely timed soundscapes. The model excels in traditional audio quality metrics and CoT-based evaluations. Furthermore, on the MovieGen Audio Bench – a benchmark assessing video audio-generation capabilities – ThinkSound significantly outperforms other leading models.

Thinksound 1 — Comparison of our ThinkSound foundation model with existing video-to-audio baselines on the VGGSound test set. ↓ indicates lower is better, ↑ indicates higher is better.

ThinkSound can seamlessly integrate with various video-generation models to provide realistic voiceovers and soundtracks for synthesized videos. Its sophisticated audio-generation capabilities offer significant potential applications in film and television sound design, audio post-production, and immersive sound experiences for gaming and virtual reality.

ThinkSound is now available open source on Hugging Face, GitHub and Alibaba’s Model Studio.

Thinksound 2 — Out-of-distribution evaluation on MovieGen Audio Bench.

Source link

What's Hot

Nvidia’s CEO says it has US approval to sell its H20 AI computer chips in China

Indonesia to establish AI center of excellence with support from Nvidia, Cisco and Indosat

FinTech platform Affirm harnesses Snowflake’s interoperable storage

Alibaba Introduces ThinkSound: An AI Model Generating Realistic Audio for Videos

GPT Proto Now Supports Google Veo 3, Making AI Video Generation

Lightricks Leaps into Long-Form Generative AI Video for Professional Content Creators

Lightricks LTXV Breaks 60‑Second Length Barrier for AI Video Generation

Rashid Johnson Painting Spotted in Trump Official’s Home

Christie’s Reports $2.1 B. Sales Total for H1 2024

Morning Links for July 16, 2025

Advisers Barbara Guggenheim and Abigail Asher Sue Each Other

Nvidia’s CEO says it has US approval to sell its H20 AI computer chips in China

Indonesia to establish AI center of excellence with support from Nvidia, Cisco and Indosat

FinTech platform Affirm harnesses Snowflake’s interoperable storage

What's Hot

Alibaba Introduces ThinkSound: An AI Model Generating Realistic Audio for Videos

Related Posts

Subscribe to Updates