Paper page - VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

VRAG-RL, a reinforcement learning framework, enhances reasoning and visual information handling in RAG methods by integrating visual perception tokens and employing specialized action spaces and rewards.

Effectively retrieving, reasoning and understanding visually rich information
remains a challenge for RAG methods. Traditional text-based methods cannot
handle visual-related information. On the other hand, current vision-based RAG
approaches are often limited by fixed pipelines and frequently struggle to
reason effectively due to the insufficient activation of the fundamental
capabilities of models. As RL has been proven to be beneficial for model
reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex
reasoning across visually rich information. With this framework, VLMs interact
with search engines, autonomously sampling single-turn or multi-turn reasoning
trajectories with the help of visual perception tokens and undergoing continual
optimization based on these samples. Our approach highlights key limitations of
RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely
incorporate images into the context, leading to insufficient reasoning token
allocation and neglecting visual-specific perception; and (ii) When models
interact with search engines, their queries often fail to retrieve relevant
information due to the inability to articulate requirements, thereby leading to
suboptimal performance. To address these challenges, we define an action space
tailored for visually rich inputs, with actions including cropping and scaling,
allowing the model to gather information from a coarse-to-fine perspective.
Furthermore, to bridge the gap between users’ original inquiries and the
retriever, we employ a simple yet effective reward that integrates query
rewriting and retrieval performance with a model-based reward. Our VRAG-RL
optimizes VLMs for RAG tasks using specially designed RL strategies, aligning
the model with real-world applications. The code is available at
https://github.com/Alibaba-NLP/VRAG{https://github.com/Alibaba-NLP/VRAG}.

Source link

What's Hot

NVIDIA’s Yejin Choi Joins Stanford HAI

IBM falls most in 15 months on tepid software sales

Google’s new Web Guide search experiment organizes results with AI

Paper page – VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

Paper page – Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning

Paper page – Elevating 3D Models: High-Quality Texture and Geometry Refinement from a Low-Quality Model

Paper page – Pixels, Patterns, but No Poetry: To See The World like Humans

US Appeals Court Overturns $8.8 M. Trademark Judgement For Yuga Labs

Old Masters ‘Making a Comeback’ in London: Morning Links

Bill Proposed To Apply Anti-Money Laundering Regulations to Art Market

France’s Culture Minister to Go on Trial for Corruption

NVIDIA’s Yejin Choi Joins Stanford HAI

IBM falls most in 15 months on tepid software sales

Google’s new Web Guide search experiment organizes results with AI

What's Hot

Paper page – VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

Related Posts

Subscribe to Updates