A Survey On Large Language Model Benchmarks - Takara TLDR

In recent years, with the rapid development of the depth and breadth of large
language models’ capabilities, various corresponding evaluation benchmarks have
been emerging in increasing numbers. As a quantitative assessment tool for
model performance, benchmarks are not only a core means to measure model
capabilities but also a key element in guiding the direction of model
development and promoting technological innovation. We systematically review
the current status and development of large language model benchmarks for the
first time, categorizing 283 representative benchmarks into three categories:
general capabilities, domain-specific, and target-specific. General capability
benchmarks cover aspects such as core linguistics, knowledge, and reasoning;
domain-specific benchmarks focus on fields like natural sciences, humanities
and social sciences, and engineering technology; target-specific benchmarks pay
attention to risks, reliability, agents, etc. We point out that current
benchmarks have problems such as inflated scores caused by data contamination,
unfair evaluation due to cultural and linguistic biases, and lack of evaluation
on process credibility and dynamic environments, and provide a referable design
paradigm for future benchmark innovation.

Source link

What's Hot

Jensen Huang says China is ‘nanoseconds behind’ the US in chipmaking, calls for reducing US export restrictions on Nvidia’s AI chips

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization – Takara TLDR

Startup founded by former DeepMind researchers Reflection AI raises $2 billion

A Survey on Large Language Model Benchmarks – Takara TLDR

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization – Takara TLDR

R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation – Takara TLDR

ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation – Takara TLDR

Frieze to Launch Abu Dhabi Fair in November 2026

Jeff Koons Returns to Gagosian with First New York Show in Seven Years

Ancient Egyptian Iconography Found in Roman-Era Bathhouse in Turkey

London Gallery Harlesden High Street Goes to Mayfair For a Pop-up

Jensen Huang says China is ‘nanoseconds behind’ the US in chipmaking, calls for reducing US export restrictions on Nvidia’s AI chips

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization – Takara TLDR

Startup founded by former DeepMind researchers Reflection AI raises $2 billion

What's Hot

A Survey on Large Language Model Benchmarks – Takara TLDR

Related Posts

Subscribe to Updates