One Token to Fool LLM-as-a-Judge

Generative reward models using LLMs are vulnerable to superficial manipulations but can be improved with data augmentation strategies. AI-generated summary Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., “:” or “.”) or reasoning openers like “Thought process:” and “Let’s solve this problem step by step.” can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

Source link

What's Hot

How Startups Can Win Talent War

Autonomy, Governance, and the New Risk Equation

Paper page – Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

One Token to Fool LLM-as-a-Judge

Paper page – Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

Paper page – BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

Paper page – From One to More: Contextual Part Latents for 3D Generation

Murujuga Rock Art in Australia Receives UNESCO World Heritage Status

‘Earth Room’ Caretaker Dies at 70

Homeland Security Targets Chicago’s National Museum of Puerto Rican Arts & Culture

1,600-Year-Old Tomb of Mayan City’s Founding King Discovered in Belize

How Startups Can Win Talent War

Autonomy, Governance, and the New Risk Equation

Paper page – Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

What's Hot

One Token to Fool LLM-as-a-Judge

Related Posts

Subscribe to Updates