arXiv AI

Robust Understanding Evaluation for Large Multimodal Models

By Advanced AI EditorApril 10, 2025No Comments2 Mins Read

[Submitted on 29 Mar 2024 (v1), last revised 9 Apr 2025 (this version, v2)]

View a PDF of the paper titled Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models, by Atsuyuki Miyai and 9 other authors

View PDF
HTML (experimental)

Abstract:This paper introduces a novel task to evaluate the robust understanding capability of Large Multimodal Models (LMMs), termed $\textbf{Unsolvable Problem Detection (UPD)}$. Multiple-choice question answering (MCQA) is widely used to assess the understanding capability of LMMs, but it does not guarantee that LMMs truly comprehend the answer. UPD assesses the LMM’s ability to withhold answers when encountering unsolvable problems of MCQA, verifying whether the model truly understands the answer. UPD encompasses three problems: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD), covering unsolvable cases like answer-lacking or incompatible choices and image-question mismatches. For the evaluation, we introduce the MM-UPD Bench, a benchmark for assessing performance across various ability dimensions. Our experiments reveal that even most LMMs, which demonstrate adequate performance on existing benchmarks, struggle significantly with MM-UPD, underscoring a novel aspect of trustworthiness that current benchmarks have overlooked. A detailed analysis shows that LMMs have different bottlenecks and chain-of-thought and self-reflection improved performance for LMMs with the bottleneck in their LLM capability. We hope our insights will enhance the broader understanding and development of more reliable LMMs.

Submission history

From: Atsuyuki Miyai [view email]
[v1]
Fri, 29 Mar 2024 17:59:53 UTC (5,256 KB)
[v2]
Wed, 9 Apr 2025 17:13:27 UTC (10,193 KB)

Previous ArticleStanford HAI’s annual report highlights rapid adoption and growing accessibility of powerful AI systems

Next Article Verizon lauds Google Cloud AI custome…

Advanced AI Editor

Leave A Reply