VOGUE: Guiding Exploration With Visual Uncertainty Improves Multimodal Reasoning - Takara TLDR

Reinforcement learning with verifiable rewards (RLVR) improves reasoning in
large language models (LLMs) but struggles with exploration, an issue that
still persists for multimodal LLMs (MLLMs). Current methods treat the visual
input as a fixed, deterministic condition, overlooking a critical source of
ambiguity and struggling to build policies robust to plausible visual
variations. We introduce $\textbf{VOGUE (Visual Uncertainty Guided
Exploration)}$, a novel method that shifts exploration from the output (text)
to the input (visual) space. By treating the image as a stochastic context,
VOGUE quantifies the policy’s sensitivity to visual perturbations using the
symmetric KL divergence between a “raw” and “noisy” branch, creating a direct
signal for uncertainty-aware exploration. This signal shapes the learning
objective via an uncertainty-proportional bonus, which, combined with a
token-entropy bonus and an annealed sampling schedule, effectively balances
exploration and exploitation. Implemented within GRPO on two model scales
(Qwen2.5-VL-3B/7B), VOGUE boosts pass@1 accuracy by an average of 2.6% on three
visual math benchmarks and 3.7% on three general-domain reasoning benchmarks,
while simultaneously increasing pass@4 performance and mitigating the
exploration decay commonly observed in RL fine-tuning. Our work shows that
grounding exploration in the inherent uncertainty of visual inputs is an
effective strategy for improving multimodal reasoning.

Source link

What's Hot

OpenAI gives content owners more control over Sora AI video app

Moveworks Showcases the Power of Its Next-Gen Copilot at Moveworks.global 2024

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments – Takara TLDR

VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning – Takara TLDR

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments – Takara TLDR

Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models – Takara TLDR

TimeSeriesScientist: A General-Purpose AI Agent for Time Series Analysis – Takara TLDR

Former ARTnews Publisher Dies at 97

National Gallery of Art Closes as a Result of Government Shutdown

Almine Rech Closes London Gallery After More Than a Decade

Record Exec and Art Collector Gets Over 4 Years

OpenAI gives content owners more control over Sora AI video app

Moveworks Showcases the Power of Its Next-Gen Copilot at Moveworks.global 2024

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments – Takara TLDR

What's Hot

VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning – Takara TLDR

Related Posts

Subscribe to Updates