UQ: Assessing Language Models On Unsolved Questions - Takara TLDR

Benchmarks shape progress in AI research. A useful benchmark should be both
difficult and realistic: questions should challenge frontier models while also
reflecting real-world usage. Yet, current paradigms face a difficulty-realism
tension: exam-style benchmarks are often made artificially difficult with
limited real-world value, while benchmarks based on real user interaction often
skew toward easy, high-frequency problems. In this work, we explore a radically
different paradigm: assessing models on unsolved questions. Rather than a
static benchmark scored once, we curate unsolved questions and evaluate models
asynchronously over time with validator-assisted screening and community
verification. We introduce UQ, a testbed of 500 challenging, diverse questions
sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi
and history, probing capabilities including reasoning, factuality, and
browsing. UQ is difficult and realistic by construction: unsolved questions are
often hard and naturally arise when humans seek answers, thus solving them
yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset
and its collection pipeline combining rule-based filters, LLM judges, and human
review to ensure question quality (e.g., well-defined and difficult); (2)
UQ-Validators, compound validation strategies that leverage the
generator-validator gap to provide evaluation signals and pre-screen candidate
solutions for human review; and (3) UQ-Platform, an open platform where experts
collectively verify questions and solutions. The top model passes UQ-validation
on only 15% of questions, and preliminary human verification has already
identified correct answers among those that passed. UQ charts a path for
evaluating frontier models on real-world, open-ended challenges, where success
pushes the frontier of human knowledge. We release UQ at
https://uq.stanford.edu.

Source link

What's Hot

Notified and Profound Partner to Help Brands Stand Out in

Introducing Gemini 2.5 Flash Image, our state-of-the-art image model

Layoffs, DEI, and Tricky AI future

UQ: Assessing Language Models on Unsolved Questions – Takara TLDR

Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning – Takara TLDR

PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent LLMs – Takara TLDR

ST-Raptor: LLM-Powered Semi-Structured Table Question Answering – Takara TLDR

France Will Return Colonial-Era Human Remains to Madagascar

Met Museum Plans Major Raphael Exhibition for 2026

Gladstone Gallery Adds Peter Saul, Hires Venus Over Manhattan Partner

People Inc. Sells Oldenburg and Van Bruggen ‘Plantoir’ Sculpture

Notified and Profound Partner to Help Brands Stand Out in

Introducing Gemini 2.5 Flash Image, our state-of-the-art image model

Layoffs, DEI, and Tricky AI future

What's Hot

UQ: Assessing Language Models on Unsolved Questions – Takara TLDR

Related Posts

Subscribe to Updates