GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems In Visual Contexts - Takara TLDR

Vision language models (VLMs) achieve unified modeling of images and text,
enabling them to accomplish complex real-world tasks through perception,
planning, and reasoning. Among these tasks, reasoning is particularly
representative, with mathematical reasoning serving as a prominent example. It
highlights the high-level capability of VLMs to comprehend mathematical
information in images and to perform sophisticated reasoning. Recently,
numerous visual mathematical reasoning benchmarks have been proposed, but they
are often restricted to geometry, lack coverage of math word problems, and
rarely assess reasoning across multiple images. To address these gaps, we
introduce GSM8K-V, a purely visual multi-image mathematical reasoning
benchmark. GSM8K-V is built by systematically mapping each sample from the
widely used text-based GSM8K into visual form. Through a carefully designed
automated image-generation pipeline combined with meticulous human annotation,
we curate 1,319 high-quality samples. We evaluate a wide range of open-source
and closed-source models on GSM8K-V. Results show that although existing VLMs
have nearly saturated performance on text-based GSM8K, there remains
substantial room for improvement on GSM8K-V. For example, the best-performing
model, Gemini-2.5-Pro, achieves 95.22% accuracy on GSM8K but only 46.93% on
GSM8K-V. We conduct a comprehensive analysis of GSM8K-V, examining the
limitations of current models as well as potential directions for improvement.
GSM8K-V offers a new perspective on visual mathematical reasoning and
establishes a benchmark to guide the development of more robust and
generalizable VLMs.

Source link

What's Hot

Nvidia to invest $100B in OpenAI to help expand ChatGPT maker’s computing power

U.S., China and India lash out at EU climate policy

Datavault AI (DVLT) Secures Multi-Million Dollar Resource Commitment From IBM

GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts – Takara TLDR

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time – Takara TLDR

EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering – Takara TLDR

SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression – Takara TLDR

Federal Judge Denies Motion to Dismiss by Kasseem ‘Swizz Beatz’ Dean in 1MBD Scandal Case

Picasso Museum in Paris Plans $59 M. Expansion with New Sculpture Park

Giverny Landscape by Monet Among Top Lots at Bonhams October Sale

You Can Now Borrow Solange’s Art Books from Her Library

Nvidia to invest $100B in OpenAI to help expand ChatGPT maker’s computing power

U.S., China and India lash out at EU climate policy

Datavault AI (DVLT) Secures Multi-Million Dollar Resource Commitment From IBM

What's Hot

GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts – Takara TLDR

Related Posts

Subscribe to Updates