Paper Page - Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study On Information-Seeking QA Over Scientific Papers

A benchmark evaluates multimodal models’ ability to interpret scientific schematic diagrams and answer related questions, revealing performance gaps and insights for improvement.

This paper introduces MISS-QA, the first benchmark specifically designed to
evaluate the ability of models to interpret schematic diagrams within
scientific literature. MISS-QA comprises 1,500 expert-annotated examples over
465 scientific papers. In this benchmark, models are tasked with interpreting
schematic diagrams that illustrate research overviews and answering
corresponding information-seeking questions based on the broader context of the
paper. We assess the performance of 18 frontier multimodal foundation models,
including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significant
performance gap between these models and human experts on MISS-QA. Our analysis
of model performance on unanswerable questions and our detailed error analysis
further highlight the strengths and limitations of current models, offering key
insights to enhance models in comprehending multimodal scientific literature.

Source link

What's Hot

What’s Happening With IBM Stock?

Putting AI To Work To Stymie The Email Fraudsters And Crooks

Why Big Investors Are All Ears For Voice AI Startups

Paper page – Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward – Takara TLDR

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions – Takara TLDR

Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling – Takara TLDR

Leon Black and Leslie Wexner’s Letters to Jeffrey Epstein Released

School of Visual Arts Transfers Ownership to Nonprofit Alumni Society

Cristin Tierney Moves Gallery to Tribeca for 15th Anniversary Exhibition

Anne Imhof Reimagines Football Jerseys with Nike

What’s Happening With IBM Stock?

Putting AI To Work To Stymie The Email Fraudsters And Crooks

Why Big Investors Are All Ears For Voice AI Startups

What's Hot

Paper page – Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Related Posts

Subscribe to Updates