UniVerse-1: Unified Audio-Video Generation Via Stitching Of Experts - Takara TLDR

We introduce UniVerse-1, a unified, Veo-3-like model capable of
simultaneously generating coordinated audio and video. To enhance training
efficiency, we bypass training from scratch and instead employ a stitching of
experts (SoE) technique. This approach deeply fuses the corresponding blocks of
pre-trained video and music generation experts models, thereby fully leveraging
their foundational capabilities. To ensure accurate annotations and temporal
alignment for both ambient sounds and speech with video content, we developed
an online annotation pipeline that processes the required training data and
generates labels during training process. This strategy circumvents the
performance degradation often caused by misalignment text-based annotations.
Through the synergy of these techniques, our model, after being finetuned on
approximately 7,600 hours of audio-video data, produces results with
well-coordinated audio-visuals for ambient sounds generation and strong
alignment for speech generation. To systematically evaluate our proposed
method, we introduce Verse-Bench, a new benchmark dataset. In an effort to
advance research in audio-video generation and to close the performance gap
with state-of-the-art models such as Veo3, we make our model and code publicly
available. We hope this contribution will benefit the broader research
community. Project page: https://dorniwang.github.io/UniVerse-1/.

Source link

What's Hot

Tencent Hunyuan Releases and Open Sources Image Model 2.1, Supporting Native 2K Images_the_model_being

Shutterstock Expands AI Horizons: New Partnership with Reka AI to Enhance Digital Asset Metadata – Adobe (NASDAQ:ADBE), Apple (NASDAQ:AAPL)

Reinforcement Learning Foundations for Deep Research Systems: A Survey – Takara TLDR

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts – Takara TLDR

Reinforcement Learning Foundations for Deep Research Systems: A Survey – Takara TLDR

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play? – Takara TLDR

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers – Takara TLDR

Leon Black and Leslie Wexner’s Letters to Jeffrey Epstein Released

School of Visual Arts Transfers Ownership to Nonprofit Alumni Society

Anne Imhof Reimagines Football Jerseys with Nike

Jason Wu, Robert Rauschenberg Collaboration for New York Fashion Week

Tencent Hunyuan Releases and Open Sources Image Model 2.1, Supporting Native 2K Images_the_model_being

Shutterstock Expands AI Horizons: New Partnership with Reka AI to Enhance Digital Asset Metadata – Adobe (NASDAQ:ADBE), Apple (NASDAQ:AAPL)

Reinforcement Learning Foundations for Deep Research Systems: A Survey – Takara TLDR

What's Hot

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts – Takara TLDR

Related Posts

Subscribe to Updates