Paper Page - RAVENEA: A Benchmark For Multimodal Retrieval-Augmented Visual Culture Understanding

RAVENEA, a retrieval-augmented benchmark, enhances visual culture understanding in VLMs through culture-focused tasks and outperforms non-augmented models across various metrics.

As vision-language models (VLMs) become increasingly integrated into daily
life, the need for accurate visual culture understanding is becoming critical.
Yet, these models frequently fall short in interpreting cultural nuances
effectively. Prior work has demonstrated the effectiveness of
retrieval-augmented generation (RAG) in enhancing cultural understanding in
text-only settings, while its application in multimodal scenarios remains
underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented
Visual culturE uNdErstAnding), a new benchmark designed to advance visual
culture understanding through retrieval, focusing on two tasks: culture-focused
visual question answering (cVQA) and culture-informed image captioning (cIC).
RAVENEA extends existing datasets by integrating over 10,000 Wikipedia
documents curated and ranked by human annotators. With RAVENEA, we train and
evaluate seven multimodal retrievers for each image query, and measure the
downstream impact of retrieval-augmented inputs across fourteen
state-of-the-art VLMs. Our results show that lightweight VLMs, when augmented
with culture-aware retrieval, outperform their non-augmented counterparts (by
at least 3.2% absolute on cVQA and 6.2% absolute on cIC). This highlights the
value of retrieval-augmented methods and culturally inclusive benchmarks for
multimodal understanding.

Source link

What's Hot

Edo Liberty explores the missing link in enterprise AI at Disrupt 2025

AI in Hiring Brings Trust Gap

Google Expands Gboard AI Writing Tools to More Android Devices

Paper page – RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

Why Language Models Hallucinate – Takara TLDR

WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning – Takara TLDR

LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation – Takara TLDR

New Banksy Work at London’s Royal Courts Immediately Covered Up

British Museum Says Bayeux Tapestry Is Safe—and More Art News

Tony Shafrazi and the Art of the Comeback

Basquiats Linked to 1MDB Scandal Auctioned by US Government

Edo Liberty explores the missing link in enterprise AI at Disrupt 2025

AI in Hiring Brings Trust Gap

Google Expands Gboard AI Writing Tools to More Android Devices

What's Hot

Paper page – RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

Related Posts

Subscribe to Updates