GeoPQA: Bridging The Visual Perception Gap In MLLMs For Geometric Reasoning - Takara TLDR

Recent advancements in reinforcement learning (RL) have enhanced the
reasoning abilities of large language models (LLMs), yet the impact on
multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like
geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate
reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps
the benefits of reasoning training. To quantify this, we design a
Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric
concepts and spatial relationships. Experiments on GeoPQA reveal significant
shortcomings of MLLMs in visual perception, which constrain RL reward signals
for effective training. To address this bottleneck, we propose a two-stage RL
training framework by first enhancing the visual perception of geometric
structures, then fostering reasoning capabilities. Applied to
Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by
9.7% and geometric problem solving by 9.1%, compared to the direct reasoning
training approach. Our method also generalizes to other vision-intensive
domains like figure understanding, highlighting the importance of perceptual
grounding in effective MLLM reasoning.

Source link

What's Hot

Mano Report – Takara TLDR

Alibaba Launches Qwen3-Max, Its Most Advanced AI Model Yet

The Missing Link in OpenAI’s Deal With Nvidia: Access to Power

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning – Takara TLDR

Mano Report – Takara TLDR

EpiCache: Episodic KV Cache Management for Long Conversational Question Answering – Takara TLDR

LIMI: Less is More for Agency – Takara TLDR

Court Rules ‘Gender Ideology’ Ban on Art Endowments Unconstitutional

Rural Danish Art Museum Acquires Painting By Artemisia Gentileschi

Dan Nadel Is Expanding American Art History, One Outlier at a Time

Bernard Arnault Says French Wealth Tax Will ‘Destroy’ the Economy

Mano Report – Takara TLDR

Alibaba Launches Qwen3-Max, Its Most Advanced AI Model Yet

The Missing Link in OpenAI’s Deal With Nvidia: Access to Power

What's Hot

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning – Takara TLDR

Related Posts

Subscribe to Updates