Paper Page - ReDit: Reward Dithering For Improved LLM Policy Optimization

ReDit, a reward dithering method, addresses issues in discrete reward systems by introducing noise, leading to smoother optimization and faster convergence compared to standard methods.

DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning
capabilities through its rule-based reward system. While it’s a ”perfect”
reward system that effectively mitigates reward hacking, such reward functions
are often discrete. Our experimental observations suggest that discrete rewards
can lead to gradient anomaly, unstable optimization, and slow convergence. To
address this issue, we propose ReDit (Reward Dithering), a method that dithers
the discrete reward signal by adding simple random noise. With this perturbed
reward, exploratory gradients are continuously provided throughout the learning
process, enabling smoother gradient updates and accelerating convergence. The
injected noise also introduces stochasticity into flat reward regions,
encouraging the model to explore novel policies and escape local optima.
Experiments across diverse tasks demonstrate the effectiveness and efficiency
of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO
with only approximately 10% the training steps, and furthermore, still exhibits
a 4% performance improvement over vanilla GRPO when trained for a similar
duration. Visualizations confirm significant mitigation of gradient issues with
ReDit. Moreover, theoretical analyses are provided to further validate these
advantages.

Source link

What's Hot

Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window – Takara TLDR

U.S. Tighten Chip Loop As China Bets On Open Source

Read MIT’s letter to Trump administration on higher ed ‘compact’

Paper page – ReDit: Reward Dithering for Improved LLM Policy Optimization

Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window – Takara TLDR

First Try Matters: Revisiting the Role of Reflection in Reasoning Models – Takara TLDR

UniVideo: Unified Understanding, Generation, and Editing for Videos – Takara TLDR

The Rubin Names 2025 Art Prize, Research and Art Projects Grants

Kochi-Muziris Biennial Announces 66 Artists for December Exhibition

Instagram Launches ‘Rings’ Awards for Creators—With KAWS as a Judge

Museums Prepare to Close Their Doors as Government Shutdown Continues

Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window – Takara TLDR

U.S. Tighten Chip Loop As China Bets On Open Source

Read MIT’s letter to Trump administration on higher ed ‘compact’

What's Hot

Paper page – ReDit: Reward Dithering for Improved LLM Policy Optimization

Related Posts

Subscribe to Updates