Paper page - SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

An enhanced multimodal language model incorporates thinking process rewards to improve reasoning and generalization, achieving superior performance on benchmarks compared to larger models.

Recent advances have shown success in eliciting strong reasoning abilities in
multimodal large language models (MLLMs) through rule-based reinforcement
learning (RL) with outcome rewards. However, this paradigm typically lacks
supervision over the thinking process leading to the final outcome.As a result,
the model may learn sub-optimal reasoning strategies, which can hinder its
generalization ability. In light of this, we propose SophiaVL-R1, as an attempt
to add reward signals for the thinking process in this paradigm. To achieve
this, we first train a thinking reward model that evaluates the quality of the
entire thinking process. Given that the thinking reward may be unreliable for
certain samples due to reward hacking, we propose the Trust-GRPO method, which
assigns a trustworthiness weight to the thinking reward during training. This
weight is computed based on the thinking reward comparison of responses leading
to correct answers versus incorrect answers, helping to mitigate the impact of
potentially unreliable thinking rewards. Moreover, we design an annealing
training strategy that gradually reduces the thinking reward over time,
allowing the model to rely more on the accurate rule-based outcome reward in
later training stages. Experiments show that our SophiaVL-R1 surpasses a series
of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU),
demonstrating strong reasoning and generalization capabilities. Notably, our
SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite
the latter having 10 times more parameters. All code, models, and datasets are
made publicly available at https://github.com/kxfan2002/SophiaVL-R1.

Source link

What's Hot

Time to Hold or Sell the Stock?

Nvidia CEO Jensen Huang calls US ban on H20 AI chip ‘deeply painful’

Paper page – Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

Paper page – SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Paper page – Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

Paper page – SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding

Paper page – VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

Documentary Photographer Dies at 81

Art Historian Protests Restoration of Monument Graffitied in 2020

Russian University Chancellor Dies Inside Soviet-Era Statue

Frida Kahlo Museum to Open in Mexico City This September

Time to Hold or Sell the Stock?

Nvidia CEO Jensen Huang calls US ban on H20 AI chip ‘deeply painful’

Paper page – Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

What's Hot

Paper page – SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Related Posts

Subscribe to Updates