Paper Page - Sailing AI By The Stars: A Survey Of Learning From Rewards In Post-Training And Test-Time Scaling Of Large Language Models

Recent developments in Large Language Models (LLMs) have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (in RLHF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities. In this survey, we present a comprehensive overview of the paradigm of learning from rewards. We categorize and analyze the strategies under this paradigm across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions. We maintain a paper collection at https://github.com/bobxwu/learning-from-rewards-llm-papers.

Source link

What's Hot

MIT-trained brothers accused of stealing $25 million in crypto in just 12 seconds: ‘There’s no government regulations’

You’ll soon be able to shop Walmart from ChatGPT

Tencent using Hunyuan AI model in 180 services amid competition with local rivals Baidu and Alibaba

Paper page – Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models

LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference – Takara TLDR

IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment – Takara TLDR

InfiniHuman: Infinite 3D Human Creation with Precise Control – Takara TLDR

Qatar Reveals It’s the Owner of Courbet’s Famous Self-Portrait

Issy Wood Paints Charli XCX—and Her ‘Britishness’—for Vanity Fair

Egyptian Archaeologists Discover Large New Kingdom Military Fortress

Joan Weinstein to Head Vice President for Getty-Wide Program Planning

MIT-trained brothers accused of stealing $25 million in crypto in just 12 seconds: ‘There’s no government regulations’

You’ll soon be able to shop Walmart from ChatGPT

Tencent using Hunyuan AI model in 180 services amid competition with local rivals Baidu and Alibaba

What's Hot

Paper page – Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models

Related Posts

Subscribe to Updates