MMR1: Enhancing Multimodal Reasoning With Variance-Aware Sampling And Open Resources - Takara TLDR

Large multimodal reasoning models have achieved rapid progress, but their
advancement is constrained by two major limitations: the absence of open,
large-scale, high-quality long chain-of-thought (CoT) data, and the instability
of reinforcement learning (RL) algorithms in post-training. Group Relative
Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone
to gradient vanishing when reward variance is low, which weakens optimization
signals and impairs convergence. This work makes three contributions: (1) We
propose Variance-Aware Sampling (VAS), a data selection strategy guided by
Variance Promotion Score (VPS) that combines outcome variance and trajectory
diversity to promote reward variance and stabilize policy optimization. (2) We
release large-scale, carefully curated resources containing ~1.6M long CoT
cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty,
and diversity, along with a fully reproducible end-to-end training codebase.
(3) We open-source a family of multimodal reasoning models in multiple scales,
establishing standardized baselines for the community. Experiments across
mathematical reasoning benchmarks demonstrate the effectiveness of both the
curated data and the proposed VAS. Comprehensive ablation studies and analyses
provide further insight into the contributions of each component. In addition,
we theoretically establish that reward variance lower-bounds the expected
policy gradient magnitude, with VAS serving as a practical mechanism to realize
this guarantee. Our code, data, and checkpoints are available at
https://github.com/LengSicong/MMR1.

Source link

What's Hot

C3.AI SHAREHOLDER ALERT: CLAIMSFILER REMINDS INVESTORS WITH LOSSES IN EXCESS OF $100,000 of Lead Plaintiff Deadline in Class Action Lawsuits Against C3.ai, Inc.

Tree Search for LLM Agent Reinforcement Learning – Takara TLDR

Business Insider Email Newsletters: Subscribe Now

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources – Takara TLDR

Tree Search for LLM Agent Reinforcement Learning – Takara TLDR

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets – Takara TLDR

Does FLUX Already Know How to Perform Physically Plausible Image Composition? – Takara TLDR

Lisa Phillips, Longtime Director of New York’s New Museum, to Retire

Submerged Port Discovery Offers Clues to Lost Tomb of Cleopatra

Forged Polish Painting Returns to the National Museum in Poznań

French Artist Invader Sues Julien Auctions Over Sale of Street Artworks

C3.AI SHAREHOLDER ALERT: CLAIMSFILER REMINDS INVESTORS WITH LOSSES IN EXCESS OF $100,000 of Lead Plaintiff Deadline in Class Action Lawsuits Against C3.ai, Inc.

Tree Search for LLM Agent Reinforcement Learning – Takara TLDR

Business Insider Email Newsletters: Subscribe Now

What's Hot

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources – Takara TLDR

Related Posts

Subscribe to Updates