Pref-GRPO: Pairwise Preference Reward-based GRPO For Stable Text-to-Image Reinforcement Learning - Takara TLDR

Recent advancements highlight the importance of GRPO-based reinforcement
learning methods and benchmarking in enhancing text-to-image (T2I) generation.
However, current methods using pointwise reward models (RM) for scoring
generated images are susceptible to reward hacking. We reveal that this happens
when minimal score differences between images are amplified after
normalization, creating illusory advantages that drive the model to
over-optimize for trivial gains, ultimately destabilizing the image generation
process. To address this, we propose Pref-GRPO, a pairwise preference
reward-based GRPO method that shifts the optimization objective from score
maximization to preference fitting, ensuring more stable training. In
Pref-GRPO, images are pairwise compared within each group using preference RM,
and the win rate is used as the reward signal. Extensive experiments
demonstrate that PREF-GRPO differentiates subtle image quality differences,
providing more stable advantages and mitigating reward hacking. Additionally,
existing T2I benchmarks are limited by coarse evaluation criteria, hindering
comprehensive model assessment. To solve this, we introduce UniGenBench, a
unified T2I benchmark comprising 600 prompts across 5 main themes and 20
subthemes. It evaluates semantic consistency through 10 primary and 27
sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our
benchmarks uncover the strengths and weaknesses of both open and closed-source
T2I models and validate the effectiveness of Pref-GRPO.

Source link

What's Hot

Data Reveals AI Search Dominance Is False Narrative, So Far 08/28/2025

Nvidia says two mystery customers accounted for 39% of Q2 revenue

Elon Musk’s xAI Hits Ex-Employee With Lawsuit Claiming Trade Secrets Ended Up At OpenAI

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning – Takara TLDR

rStar2-Agent: Agentic Reasoning Technical Report – Takara TLDR

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection – Takara TLDR

OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning – Takara TLDR

Woodmere Art Museum Sues Trump Administration Over Canceled IMLS Grant

Barbara Gladstone’s Chelsea Townhouse in NYC Sells for $13.1 M.

Trump Meets with Smithsonian Leader Amid Threats of Content Review

Australian School Faces Pushback over AI Art Course—and More Art News

Data Reveals AI Search Dominance Is False Narrative, So Far 08/28/2025

Nvidia says two mystery customers accounted for 39% of Q2 revenue

Elon Musk’s xAI Hits Ex-Employee With Lawsuit Claiming Trade Secrets Ended Up At OpenAI

What's Hot

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning – Takara TLDR

Related Posts

Subscribe to Updates