Paper page - Vidi: Large Multimodal Models for Video Understanding and Editing

Humans naturally share information with those they are connected to, and
video has become one of the dominant mediums for communication and expression
on the Internet. To support the creation of high-quality large-scale video
content, a modern pipeline requires a comprehensive understanding of both the
raw input materials (e.g., the unedited footage captured by cameras) and the
editing components (e.g., visual effects). In video editing scenarios, models
must process multiple modalities (e.g., vision, audio, text) with strong
background knowledge and handle flexible input lengths (e.g., hour-long raw
videos), which poses significant challenges for traditional models. In this
report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a
wide range of video understand editing scenarios. The first release focuses on
temporal retrieval, i.e., identifying the time ranges within the input videos
corresponding to a given text query, which plays a critical role in intelligent
editing. The model is capable of processing hour-long videos with strong
temporal understanding capability, e.g., retrieve time ranges for certain
queries. To support a comprehensive evaluation in real-world scenarios, we also
present the VUE-TR benchmark, which introduces five key advancements. 1) Video
duration: significantly longer than existing temporal retrival datasets, 2)
Audio support: includes audio-based queries, 3) Query format: diverse query
lengths/formats, 4) Annotation quality: ground-truth time ranges are manually
annotated. 5) Evaluation metric: a refined IoU metric to support evaluation
over multiple time ranges. Remarkably, Vidi significantly outperforms leading
proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task,
indicating its superiority in video editing scenarios.

Source link

What's Hot

How to Create Seamless Video Loops with MidJourney 7 AI Video

Paper page – Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI

China is turning AI into a commodity – Charles Ormond

Paper page – Vidi: Large Multimodal Models for Video Understanding and Editing

Paper page – Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI

Discovering and using Spelke segments

Paper page – Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows

David Geffen Sued By Estranged Husband for Breach of Contract

Auction House Will Sell Egyptian Artifact Despite Concern From Experts

Anish Kapoor Lists New York Apartment for $17.75 M.

Street Fighter 6 Community Rocked by AI Art Controversy

How to Create Seamless Video Loops with MidJourney 7 AI Video

Paper page – Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI

China is turning AI into a commodity – Charles Ormond

What's Hot

Paper page – Vidi: Large Multimodal Models for Video Understanding and Editing

Related Posts

Subscribe to Updates