StatEval: A Comprehensive Benchmark For Large Language Models In Statistics - Takara TLDR

Large language models (LLMs) have demonstrated remarkable advances in
mathematical and logical reasoning, yet statistics, as a distinct and
integrative discipline, remains underexplored in benchmarking efforts. To
address this gap, we introduce \textbf{StatEval}, the first comprehensive
benchmark dedicated to statistics, spanning both breadth and depth across
difficulty levels. StatEval consists of 13,817 foundational problems covering
undergraduate and graduate curricula, together with 2374 research-level proof
tasks extracted from leading journals. To construct the benchmark, we design a
scalable multi-agent pipeline with human-in-the-loop validation that automates
large-scale problem extraction, rewriting, and quality control, while ensuring
academic rigor. We further propose a robust evaluation framework tailored to
both computational and proof-based tasks, enabling fine-grained assessment of
reasoning ability. Experimental results reveal that while closed-source models
such as GPT5-mini achieve below 57\% on research-level problems, with
open-source models performing significantly lower. These findings highlight the
unique challenges of statistical reasoning and the limitations of current LLMs.
We expect StatEval to serve as a rigorous benchmark for advancing statistical
intelligence in large language models. All data and code are available on our
web platform: https://stateval.github.io/.

Source link

What's Hot

PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs – Takara TLDR

All colleges offered preferential funding plan MIT rejected

Researchers find that retraining only small parts of AI models can cut costs and prevent forgetting

StatEval: A Comprehensive Benchmark for Large Language Models in Statistics – Takara TLDR

PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs – Takara TLDR

MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval – Takara TLDR

Mitigating Overthinking through Reasoning Shaping – Takara TLDR

Egyptian Archaeologists Discover Large New Kingdom Military Fortress

Joan Weinstein to Head Vice President for Getty-Wide Program Planning

India Plots First Venice Biennale Pavilion in Seven Years

Artist Behind Canterbury Cathedral Art Responds to JD Vance, Elon Musk

PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs – Takara TLDR

All colleges offered preferential funding plan MIT rejected

Researchers find that retraining only small parts of AI models can cut costs and prevent forgetting

What's Hot

StatEval: A Comprehensive Benchmark for Large Language Models in Statistics – Takara TLDR

Related Posts

Subscribe to Updates