Paper page - VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

VisionThink dynamically adjusts image resolution and visual token processing for efficient and effective vision-language tasks, improving performance on OCR tasks while reducing token usage in simpler tasks.

Recent advancements in vision-language models (VLMs) have improved
performance by increasing the number of visual tokens, which are often
significantly longer than text tokens. However, we observe that most real-world
scenarios do not require such an extensive number of visual tokens. While the
performance drops significantly in a small subset of OCR-related tasks, models
still perform accurately in most other general VQA tasks with only 1/4
resolution. Therefore, we propose to dynamically process distinct samples with
different resolutions, and present a new paradigm for visual token compression,
namely, VisionThink. It starts with a downsampled image and smartly decides
whether it is sufficient for problem solving. Otherwise, the model could output
a special token to request the higher-resolution image. Compared to existing
Efficient VLM methods that compress tokens using fixed pruning ratios or
thresholds, VisionThink autonomously decides whether to compress tokens case by
case. As a result, it demonstrates strong fine-grained visual understanding
capability on OCR-related tasks, and meanwhile saves substantial visual tokens
on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge
strategy to successfully apply RL to general VQA tasks. Moreover, we carefully
design a reward function and penalty mechanism to achieve a stable and
reasonable image resize call ratio. Extensive experiments demonstrate the
superiority, efficiency, and effectiveness of our method. Our code is available
at https://github.com/dvlab-research/VisionThink.

Source link

What's Hot

Airtel Partners with Perplexity to Offer Free Pro Subscriptions

Paper page – MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Meta rejects EU’s voluntary AI rules: Here’s why

Paper page – VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Paper page – MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Paper page – TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

Paper page – FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Painter Says DHS Stole His Work for Post About ‘Homeland’s Heritage’

The Art Show 2025 Canceled by ADAA in “Strategic Pause”

Yale Art Gallery Rejects Federal Grants for Africa Migration Show

With NEA Funding Slashed, Black Arts Institutions Face a Tough Future

Airtel Partners with Perplexity to Offer Free Pro Subscriptions

Paper page – MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Meta rejects EU’s voluntary AI rules: Here’s why

What's Hot

Paper page – VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Related Posts

Subscribe to Updates