Ovi: Twin Backbone Cross-Modal Fusion For Audio-Video Generation - Takara TLDR

Audio-video generation has often relied on complex multi-stage architectures
or sequential synthesis of sound and visuals. We introduce Ovi, a unified
paradigm for audio-video generation that models the two modalities as a single
generative process. By using blockwise cross-modal fusion of twin-DiT modules,
Ovi achieves natural synchronization and removes the need for separate
pipelines or post hoc alignment. To facilitate fine-grained multimodal fusion
modeling, we initialize an audio tower with an architecture identical to that
of a strong pretrained video model. Trained from scratch on hundreds of
thousands of hours of raw audio, the audio tower learns to generate realistic
sound effects, as well as speech that conveys rich speaker identity and
emotion. Fusion is obtained by jointly training the identical video and audio
towers via blockwise exchange of timing (via scaled-RoPE embeddings) and
semantics (through bidirectional cross-attention) on a vast video corpus. Our
model enables cinematic storytelling with natural speech and accurate,
context-matched sound effects, producing movie-grade video clips. All the
demos, code and model weights are published at https://aaxwaz.github.io/Ovi

Source link

What's Hot

OpenAI Shows Off Contract Review Agent – Artificial Lawyer

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents – Takara TLDR

Perplexity’s AI browser Comet could cut need for extra hires, says CEO Aravind Srinivas | Technology News

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation – Takara TLDR

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents – Takara TLDR

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping – Takara TLDR

Automated Structured Radiology Report Generation with Rich Clinical Context – Takara TLDR

Former ARTnews Publisher Dies at 97

National Gallery of Art Closes as a Result of Government Shutdown

Almine Rech Closes London Gallery After More Than a Decade

Record Exec and Art Collector Gets Over 4 Years

OpenAI Shows Off Contract Review Agent – Artificial Lawyer

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents – Takara TLDR

Perplexity’s AI browser Comet could cut need for extra hires, says CEO Aravind Srinivas | Technology News

What's Hot

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation – Takara TLDR

Related Posts

Subscribe to Updates