Paper Page - DiscoVLA: Discrepancy Reduction In Vision, Language, And Alignment For Parameter-Efficient Video-Text Retrieval

The paper proposes DiscoVLA to improve video-text retrieval using CLIP by addressing vision, language, and alignment discrepancies, achieving superior performance.

The parameter-efficient adaptation of the image-text pretraining model CLIP
for video-text retrieval is a prominent area of research. While CLIP is focused
on image-level vision-language matching, video-text retrieval demands
comprehensive understanding at the video level. Three key discrepancies emerge
in the transfer from image-level to video-level: vision, language, and
alignment. However, existing methods mainly focus on vision while neglecting
language and alignment. In this paper, we propose Discrepancy Reduction in
Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all
three discrepancies. Specifically, we introduce Image-Video Features Fusion to
integrate image-level and video-level features, effectively tackling both
vision and language discrepancies. Additionally, we generate pseudo image
captions to learn fine-grained image-level alignment. To mitigate alignment
discrepancies, we propose Image-to-Video Alignment Distillation, which
leverages image-level alignment knowledge to enhance video-level alignment.
Extensive experiments demonstrate the superiority of our DiscoVLA. In
particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous
methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is
available at https://github.com/LunarShen/DsicoVLA.

Source link

What's Hot

Creating uniquely human digital banking experiences at TD

C3 AI Stock Plunges After ‘Completely Unacceptable’ Q1 Sales – C3.ai (NYSE:AI)

Meta says its Llama AI models being used by banks, tech companies

Paper page – DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

Paper page – Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability

Paper page – Evaluating, Synthesizing, and Enhancing for Customer Support Conversation

Paper page – Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

Midjourney Slams Lawsuit Filed by Disney to Prevent AI Training

Smithsonian Updates Museum Display on Impeachment To Include Trump

Funder Tried to Hijack Kandinsky Art Theft Suits, Says Collector

How to Stylize Your Images with Flux Kontext in ComfyUI

Creating uniquely human digital banking experiences at TD

C3 AI Stock Plunges After ‘Completely Unacceptable’ Q1 Sales – C3.ai (NYSE:AI)

Meta says its Llama AI models being used by banks, tech companies

What's Hot

Paper page – DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

Related Posts

Subscribe to Updates