When Judgment Becomes Noise: How Design Failures In LLM Judge Benchmarks Silently Undermine Validity - Takara TLDR

LLM-judged benchmarks are increasingly used to evaluate complex model
behaviors, yet their design introduces failure modes absent in conventional
ground-truth based benchmarks. We argue that without tight objectives and
verifiable constructions, benchmark rankings can produce high-confidence
rankings that are in fact largely noise. We introduce two mechanisms to
diagnose these issues. Schematic adherence quantifies how much of a judge’s
overall verdict is explained by the explicit evaluation schema, revealing
unexplained variance when judges deviate from their own rubric. Psychometric
validity aggregates internal consistency and discriminant validity signals to
quantify irreducible uncertainty in any benchmarking run. Applying these tools
to Arena-Hard Auto, we find severe schema incoherence and factor collapse
across popular judges: for example, unexplained variance exceeding 90 percent
for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We
also show that the ELO-style aggregation used by Arena-Hard Auto collapses and
masks genuine ranking uncertainty. Our results highlight design failures that
undermine validity and offer actionable principles for building better-scoped,
reliability-aware LLM-judged benchmarks. We release our code at
https://anonymous.4open.science/r/judgment-to-noise-947D/README.md

Source link

What's Hot

Tesla Model S Plaid battles China’s 1500 hp monster Nurburgring monster, with surprising results

Thinking Augmented Pre-training – Takara TLDR

When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity – Takara TLDR

When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity – Takara TLDR

Thinking Augmented Pre-training – Takara TLDR

Seedream 4.0: Toward Next-generation Multimodal Image Generation – Takara TLDR

MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model – Takara TLDR

Judge Rejects Ronald Perelman’s $400 M. Art Insurance Claim

Drag Queen Alexis Stone Became the Mona Lisa for Milan Fashion Show

Steve McQueen’s Granddaughter Lawsuit for $68 M. Pollock Painting

Marina Abramović to Have Exhibition at Venice’s Accademia in 2026

Tesla Model S Plaid battles China’s 1500 hp monster Nurburgring monster, with surprising results

Thinking Augmented Pre-training – Takara TLDR