Large language models (LLMs) have demonstrated remarkable advances in
mathematical and logical reasoning, yet statistics, as a distinct and
integrative discipline, remains underexplored in benchmarking efforts. To
address this gap, we introduce \textbf{StatEval}, the first comprehensive
benchmark dedicated to statistics, spanning both breadth and depth across
difficulty levels. StatEval consists of 13,817 foundational problems covering
undergraduate and graduate curricula, together with 2374 research-level proof
tasks extracted from leading journals. To construct the benchmark, we design a
scalable multi-agent pipeline with human-in-the-loop validation that automates
large-scale problem extraction, rewriting, and quality control, while ensuring
academic rigor. We further propose a robust evaluation framework tailored to
both computational and proof-based tasks, enabling fine-grained assessment of
reasoning ability. Experimental results reveal that while closed-source models
such as GPT5-mini achieve below 57\% on research-level problems, with
open-source models performing significantly lower. These findings highlight the
unique challenges of statistical reasoning and the limitations of current LLMs.
We expect StatEval to serve as a rigorous benchmark for advancing statistical
intelligence in large language models. All data and code are available on our
web platform: https://stateval.github.io/.