One Token To Fool LLM-as-a-Judge

Generative reward models using LLMs are vulnerable to superficial manipulations but can be improved with data augmentation strategies. AI-generated summary Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., “:” or “.”) or reasoning openers like “Thought process:” and “Let’s solve this problem step by step.” can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

Source link

What's Hot

Dynamic KV Cache Scheduling in Heterogeneous Memory Systems for LLM Inference (Rensselaer Polytechnic Institute, IBM)

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection – Takara TLDR

Doubts on AI and Nvidia in an uncertain economy : NPR

One Token to Fool LLM-as-a-Judge

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection – Takara TLDR

OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning – Takara TLDR

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD – Takara TLDR

Woodmere Art Museum Sues Trump Administration Over Canceled IMLS Grant

Barbara Gladstone’s Chelsea Townhouse in NYC Sells for $13.1 M.

Trump Meets with Smithsonian Leader Amid Threats of Content Review

Australian School Faces Pushback over AI Art Course—and More Art News

Dynamic KV Cache Scheduling in Heterogeneous Memory Systems for LLM Inference (Rensselaer Polytechnic Institute, IBM)

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection – Takara TLDR

Doubts on AI and Nvidia in an uncertain economy : NPR

What's Hot

One Token to Fool LLM-as-a-Judge

Related Posts

Subscribe to Updates