Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

GP Batteries Partners with GPTBots to Build a Global

Inside Bluefish’s $20M Pitch Deck to Help Brands Like Adidas Show Up in AI Search

New Legal AI Eval System ‘Lake Merritt’ Launches – Artificial Lawyer

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
Tencent Hunyuan

The Strongest Open Source Video Sound Effect Generation Model Released – Tencent Hunyuan_audio_videos

By Advanced AI EditorSeptember 2, 2025No Comments8 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


The Tencent Hunyuan Laboratory team has just open-sourced a new generation AI audio generation system called “Hunyuan Video-Foley,” which provides realistic sound effects for AI-generated videos. It is said that the system can “listen” to video content and generate high-quality soundtracks perfectly synchronized with the visual actions. Let’s take a look.

Have you ever felt something was missing while watching an AI-generated video? The visuals might be stunning, but the eerie silence can break the immersion. In the film industry, the sounds that fill this silence—rustling leaves, booming thunder, the crisp sound of glass clinking—are known as Foley art, a meticulous craft performed by professionals.

Achieving such detail restoration with AI is a significant challenge. For years, automated systems have struggled to generate convincing audio for videos.

Detailed Introduction of Model Capabilities

Part.1

Hunyuan Video-Foley is an end-to-end multimodal audio generation framework capable of producing high-fidelity audio that is consistent with the input video and text semantics, tightly aligned in timing. To achieve multimodal conditional balanced responses, the framework employs a multimodal audio generation architecture that includes dual-stream MMDiT and single-stream audio DiT. The MMDiT module enhances the temporal dependencies between video and audio through an interleaved RoPE joint self-attention mechanism and injects text semantic information using a cross-attention mechanism.

By aligning the hidden layer embeddings of the single-stream audio DiT module with audio features extracted from a pre-trained self-supervised model through representation alignment loss, the quality of the generated audio is significantly improved. Additionally, an improved DAC-VAE has been designed based on the DAC structure, replacing its discrete tokens with 128-dimensional continuous representations, greatly enhancing audio reconstruction capabilities.

Hunyuan Video-Foley has been trained on a high-quality multimodal dataset of 100,000 hours, covering the vast majority of video scenarios, and has demonstrated exceptional performance on multiple authoritative evaluation benchmarks, achieving the most powerful video sound effect generation results to date.

How Did Tencent Solve the Challenge of AI Sound Effect Generation for Videos?

Part.2

The core reason for the poor performance of past video-to-audio (V2A) models has been termed “modal imbalance” by researchers. Essentially, AI has focused more on parsing textual instructions rather than analyzing the actual content of the video.

For example, if a model is given a video showing a bustling beach (with pedestrians strolling and seagulls soaring), but the text prompt only requests “sound of waves,” the system often only generates wave sounds. The AI completely ignores the sounds of footsteps on the sand and birds chirping, resulting in a lifeless scene.

Moreover, the generated audio quality is often unsatisfactory, and there is a severe lack of high-quality audio-video training data.

The Tencent Hunyuan team tackled these challenges from three dimensions:

Constructing a database of video-audio-text deion triplets spanning 100,000 hours, filtering out low-quality content from the internet through an automated pipeline, removing segments with long periods of silence or compressed, blurry audio to ensure the model learns from optimal material.

Designing an intelligent multi-task architecture. The system first focuses intensely on the audiovisual temporal correlation—such as precisely matching the sound of footsteps with the moment the shoe sole contacts the ground. Once the timing is locked, it integrates text prompts to understand the overall atmosphere of the scene. This dual mechanism ensures that no video detail is overlooked.

Employing a Representation Alignment (REPA) training strategy to ensure audio quality. By comparing AI outputs with features from pre-trained professional audio models, it guides the generation of clearer, fuller, and more stable sound effects.

Constructing a database of video-audio-text deion triplets spanning 100,000 hours, filtering out low-quality content from the internet through an automated pipeline, removing segments with long periods of silence or compressed, blurry audio to ensure the model learns from optimal material.

Designing an intelligent multi-task architecture. The system first focuses intensely on the audiovisual temporal correlation—such as precisely matching the sound of footsteps with the moment the shoe sole contacts the ground. Once the timing is locked, it integrates text prompts to understand the overall atmosphere of the scene. This dual mechanism ensures that no video detail is overlooked.

Employing a Representation Alignment (REPA) training strategy to ensure audio quality. By comparing AI outputs with features from pre-trained professional audio models, it guides the generation of clearer, fuller, and more stable sound effects.

Measured Results Showcase Advantages

Part.3

In comparative tests with mainstream AI models, the audio generation advantages of Hunyuan Video-Foley are evident. Not only are the computational evaluation metrics superior, but human listeners also unanimously agree that its output sound quality is better, more closely matches the video, and has more precise timing.

Official Examples:

In terms of synchronizing audio with visual content and timing alignment, the system has achieved significant improvements across all evaluation dimensions. Results from multiple authoritative datasets confirm this:

This research effectively bridges the gap between silent AI videos and immersive viewing experiences. It brings the magic of professional Foley sound effects into the realm of automated content creation, providing powerful support for filmmakers, animators, and creators worldwide.

In this regard, we have evaluated several outstanding videos presented officially from a professional perspective; please see the specific content:

01

Coffee Pouring Video Prompt: Soft sound of pouring coffee

First, regarding the video generation issue, the coffee cup is too large, and the proportion of the hand and the cup being poured is very strange, so when the video has issues, the generated audio will feel inappropriate no matter what; it is recommended to replace this video first.

Returning to the audio, the sound itself also has issues; it does not resemble the sound of pouring into a white porcelain coffee cup. Additionally, the sound of the water has no variation; there is no sound change from an empty cup to a full cup. It is commendable that background sounds were still generated even without prompts. Overall, the sound does not reach a professional level.

02

Stream Surface Flying Video Prompt: The gentle sound of the stream flowing, with a background of a melodious piano solo carrying a serene classical melody that brings peace and tranquility

Although the name and function highlight the Foley sound effects, it is actually a comprehensive AI model for generating audio for videos that includes music generation. If evaluated by more professional standards, the sound of the stream is a bit thin; the water sound is much quieter and subtler than what is presented in the video, and the stream does not change with the camera’s movement. This sound effect can only be considered quite average and somewhat amateurish as background sound. Considering the music generation and the mixing ratio of music with stream sounds, it is still noteworthy.

03

A Girl Walking Through a Forest Video Prompt: The sound of vines and leaves rustling, curtains being lifted, accompanied by the gentle footsteps of a person entering a clearing in the woods

Looking at the audio from the arrangement of the prompts, the sound of green vines and leaves is too dry and does not accurately match the sound of hands brushing against the vines. If one does not look at the prompts, they might completely overlook the sound of the vines. The most peculiar thing is the footsteps; the sounds of going and returning footsteps clearly occur on the same ground, but for some reason, they differ greatly, possibly due to the prompt “gentle footsteps on the forest floor” limiting the output. Lastly, the sound of dry leaves crunching is very harsh and does not correspond with the season and overall feeling depicted in the video.

04

Light Rain Car Departure Prompt: Wheels rolling over a wet surface

This video’s sound is the best among the four, with the soft sound of rain and the background noise of city streets as the car drives out onto the wet road. Overall, it is quite good, especially the distant sound of a car coming from afar, which was recognized and provided the sound effect of a car driving on a wet road from a distance, which is very nice.

However, according to professional standards, the foreground sound of the car driving out shows a noticeable up-and-down fluctuation, and the details of the rear tires rolling over small puddles are missing; these details are what define professionalism.

In summary, this video-generated audio model shows improvements over other similar models in certain details but does not represent a qualitative leap, nor does it provide us with a markedly different experience. Its approach of generating audio based on prompts is also quite common at present.

From the audio perspective, these so-called professional-level generated sound effects are still somewhat distant from true professionalism, especially when paired with already problematic AI-generated visuals. Considering that these more vertical large models are still in their infancy, we remain very curious about the future.

As various AI manufacturers increasingly attach terms like “professional” and “cinema-level” to AIGC, the actual results often fall far short of the high expectations set, and we hope these models can truly engage in more communication with professionals.

What do you think? Feel free to leave a comment!

The Tencent Hunyuan Laboratory team has just open-sourced a new generation AI audio generation system called “Hunyuan Video-Foley,” which provides realistic sound effects for AI-generated videos.返回搜狐,查看更多



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleC3.ai (AI) Reports Q2 Results Tomorrow
Next Article WhatsApp tests one-hour disappearing chats and AI writing tools in latest update
Advanced AI Editor
  • Website

Related Posts

Tencent Hunyuan Opens Sources for 4 Small-scale Models

August 28, 2025

Tencent Hunyuan Releases Open-Source 7B Model, Beats o1-Mini On Many Benchmarks

August 4, 2025

Tencent’s Hunyuan Team Releases Open-Source Hunyuan3D World Model 1.0, Can Generate Explorable 3D Worlds

July 28, 2025

Comments are closed.

Latest Posts

80 Museum Exhibitions and Biennials to See in Fall 2025

Woodmere Art Museum Sues Trump Administration Over Canceled IMLS Grant

Barbara Gladstone’s Chelsea Townhouse in NYC Sells for $13.1 M.

Trump Meets with Smithsonian Leader Amid Threats of Content Review

Latest Posts

GP Batteries Partners with GPTBots to Build a Global

September 2, 2025

Inside Bluefish’s $20M Pitch Deck to Help Brands Like Adidas Show Up in AI Search

September 2, 2025

New Legal AI Eval System ‘Lake Merritt’ Launches – Artificial Lawyer

September 2, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • GP Batteries Partners with GPTBots to Build a Global
  • Inside Bluefish’s $20M Pitch Deck to Help Brands Like Adidas Show Up in AI Search
  • New Legal AI Eval System ‘Lake Merritt’ Launches – Artificial Lawyer
  • How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on τ-bench – Takara TLDR
  • Hackers Manipulate Claude AI Chatbot as Part of at Least 17 Cyber Attacks

Recent Comments

  1. index on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  2. rukumMup on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  3. Josephkance on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  4. 成人 on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  5. Juniorfar on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.