RStar2-Agent: Agentic Reasoning Technical Report - Takara TLDR

We introduce rStar2-Agent, a 14B math reasoning model trained with agentic
reinforcement learning to achieve frontier-level performance. Beyond current
long CoT, the model demonstrates advanced cognitive behaviors, such as thinking
carefully before using Python coding tools and reflecting on code execution
feedback to autonomously explore, verify, and refine intermediate steps in
complex problem-solving. This capability is enabled through three key
innovations that makes agentic RL effective at scale: (i) an efficient RL
infrastructure with a reliable Python code environment that supports
high-throughput execution and mitigates the high rollout costs, enabling
training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic
RL algorithm with a Resample-on-Correct rollout strategy that addresses the
inherent environment noises from coding tools, allowing the model to reason
more effectively in a code environment; (iii) An efficient agent training
recipe that starts with non-reasoning SFT and progresses through multi-RL
stages, yielding advanced cognitive abilities with minimal compute cost. To
this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in
only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on
AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly
shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates
strong generalization to alignment, scientific reasoning, and agentic tool-use
tasks. Code and training recipes are available at
https://github.com/microsoft/rStar.

Source link

What's Hot

How DeepSeek AI will reshape Nigeria’s business landscape

OpenAI and Anthropic evaluated each others’ models – which ones came out on top

Data Reveals AI Search Dominance Is False Narrative, So Far 08/28/2025

rStar2-Agent: Agentic Reasoning Technical Report – Takara TLDR

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning – Takara TLDR

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection – Takara TLDR

OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning – Takara TLDR

Woodmere Art Museum Sues Trump Administration Over Canceled IMLS Grant

Barbara Gladstone’s Chelsea Townhouse in NYC Sells for $13.1 M.

Trump Meets with Smithsonian Leader Amid Threats of Content Review

Australian School Faces Pushback over AI Art Course—and More Art News

How DeepSeek AI will reshape Nigeria’s business landscape

OpenAI and Anthropic evaluated each others’ models – which ones came out on top

Data Reveals AI Search Dominance Is False Narrative, So Far 08/28/2025

What's Hot

rStar2-Agent: Agentic Reasoning Technical Report – Takara TLDR

Related Posts

Subscribe to Updates