Paper Page - RAVENEA: A Benchmark For Multimodal Retrieval-Augmented Visual Culture Understanding

RAVENEA, a retrieval-augmented benchmark, enhances visual culture understanding in VLMs through culture-focused tasks and outperforms non-augmented models across various metrics.

As vision-language models (VLMs) become increasingly integrated into daily
life, the need for accurate visual culture understanding is becoming critical.
Yet, these models frequently fall short in interpreting cultural nuances
effectively. Prior work has demonstrated the effectiveness of
retrieval-augmented generation (RAG) in enhancing cultural understanding in
text-only settings, while its application in multimodal scenarios remains
underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented
Visual culturE uNdErstAnding), a new benchmark designed to advance visual
culture understanding through retrieval, focusing on two tasks: culture-focused
visual question answering (cVQA) and culture-informed image captioning (cIC).
RAVENEA extends existing datasets by integrating over 10,000 Wikipedia
documents curated and ranked by human annotators. With RAVENEA, we train and
evaluate seven multimodal retrievers for each image query, and measure the
downstream impact of retrieval-augmented inputs across fourteen
state-of-the-art VLMs. Our results show that lightweight VLMs, when augmented
with culture-aware retrieval, outperform their non-augmented counterparts (by
at least 3.2% absolute on cVQA and 6.2% absolute on cIC). This highlights the
value of retrieval-augmented methods and culturally inclusive benchmarks for
multimodal understanding.

Source link

What's Hot

C3.ai’s Fall from Grace: A Failing Stock in the AI Revolution

OpenAI Spends $10 Billion to Get Into the Chip Business

Koah raises $5M to bring ads into AI apps

Paper page – RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth – Takara TLDR

Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding – Takara TLDR

Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers – Takara TLDR

Tony Shafrazi and the Art of the Comeback

Basquiats Linked to 1MDB Scandal Auctioned by US Government

US Ambassador to UK Fills Residence with Impressionist Masters

New Code of Ethics Implores UK Museums to End Fossil Fuel Sponsorships

C3.ai’s Fall from Grace: A Failing Stock in the AI Revolution

OpenAI Spends $10 Billion to Get Into the Chip Business

Koah raises $5M to bring ads into AI apps

What's Hot

Paper page – RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

Related Posts

Subscribe to Updates