UniVideo: Unified Understanding, Generation, And Editing For Videos - Takara TLDR

Unified multimodal models have shown promising results in multimodal content
generation and editing but remain largely limited to the image domain. In this
work, we present UniVideo, a versatile framework that extends unified modeling
to the video domain. UniVideo adopts a dual-stream design, combining a
Multimodal Large Language Model (MLLM) for instruction understanding with a
Multimodal DiT (MMDiT) for video generation. This design enables accurate
interpretation of complex multimodal instructions while preserving visual
consistency. Built on this architecture, UniVideo unifies diverse video
generation and editing tasks under a single multimodal instruction paradigm and
is jointly trained across them. Extensive experiments demonstrate that UniVideo
matches or surpasses state-of-the-art task-specific baselines in
text/image-to-video generation, in-context video generation and in-context
video editing. Notably, the unified design of UniVideo enables two forms of
generalization. First, UniVideo supports task composition, such as combining
editing with style transfer, by integrating multiple capabilities within a
single instruction. Second, even without explicit training on free-form video
editing, UniVideo transfers its editing capability from large-scale image
editing data to this setting, handling unseen instructions such as
green-screening characters or changing materials within a video. Beyond these
core capabilities, UniVideo also supports visual-prompt-based video generation,
where the MLLM interprets visual prompts and guides the MMDiT during synthesis.
To foster future research, we will release our model and code.

Source link

What's Hot

Read MIT’s letter to Trump administration on higher ed ‘compact’

Will updating your AI agents help or hamper their performance? Raindrop's new tool Experiments tells you

It’s not too late for Apple to get AI right

UniVideo: Unified Understanding, Generation, and Editing for Videos – Takara TLDR

First Try Matters: Revisiting the Role of Reflection in Reasoning Models – Takara TLDR

Reinforcing Diffusion Models by Direct Group Preference Optimization – Takara TLDR

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency – Takara TLDR

The Rubin Names 2025 Art Prize, Research and Art Projects Grants

Kochi-Muziris Biennial Announces 66 Artists for December Exhibition

Instagram Launches ‘Rings’ Awards for Creators—With KAWS as a Judge

Museums Prepare to Close Their Doors as Government Shutdown Continues

Read MIT’s letter to Trump administration on higher ed ‘compact’

Will updating your AI agents help or hamper their performance? Raindrop's new tool Experiments tells you

It’s not too late for Apple to get AI right

What's Hot

UniVideo: Unified Understanding, Generation, and Editing for Videos – Takara TLDR

Related Posts

Subscribe to Updates