Paper Page - Revisiting Reinforcement Learning For LLM Reasoning From A Cross-Domain Perspective

Guru, a diverse RL reasoning corpus, highlights domain-specific training needs and demonstrates improved performance in complex tasks for RL-enhanced LLMs.

Reinforcement learning (RL) has emerged as a promising approach to improve
large language model (LLM) reasoning, yet most open efforts focus narrowly on
math and code, limiting our understanding of its broader applicability to
general reasoning. A key challenge lies in the lack of reliable, scalable RL
reward signals across diverse reasoning domains. We introduce Guru, a curated
RL reasoning corpus of 92K verifiable examples spanning six reasoning
domains–Math, Code, Science, Logic, Simulation, and Tabular–each built
through domain-specific reward design, deduplication, and filtering to ensure
reliability and effectiveness for RL training. Based on Guru, we systematically
revisit established findings in RL for LLM reasoning and observe significant
variation across domains. For example, while prior work suggests that RL
primarily elicits existing knowledge from pretrained models, our results reveal
a more nuanced pattern: domains frequently seen during pretraining (Math, Code,
Science) easily benefit from cross-domain RL training, while domains with
limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain
training to achieve meaningful performance gains, suggesting that RL is likely
to facilitate genuine skill acquisition. Finally, we present Guru-7B and
Guru-32B, two models that achieve state-of-the-art performance among open
models RL-trained with publicly available data, outperforming best baselines by
7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We
also show that our models effectively improve the Pass@k performance of their
base models, particularly on complex tasks less likely to appear in pretraining
data. We release data, models, training and evaluation code to facilitate
general-purpose reasoning at: https://github.com/LLM360/Reasoning360

Source link

What's Hot

IBM and NASA Develop a Digital Twin of the Sun to Predict Future Solar Storms

Tesla Partners with DeepSeek and ByteDance to Launch AI Voice Assistant in China

Perplexity Comet’s flaw exposes how dangerous agentic AI can be

Paper page – Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR – Takara TLDR

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts – Takara TLDR

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries – Takara TLDR

Mütter Museum in Philadelphia Announces New Policy for Human Remains

Inigo Philbrick, Art Dealer Convicted of Fraud, Appears in BBC Film

Links for August 22, 2025

White House Targets Specific Artworks at Smithsonian Museums

IBM and NASA Develop a Digital Twin of the Sun to Predict Future Solar Storms

Tesla Partners with DeepSeek and ByteDance to Launch AI Voice Assistant in China

Perplexity Comet’s flaw exposes how dangerous agentic AI can be

What's Hot

Paper page – Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

Related Posts

Subscribe to Updates