AutoCodeBench: Large Language Models Are Automatic Code Benchmark Generators - Takara TLDR

Large Language Models (LLMs) have demonstrated remarkable capabilities across
various domains, with code generation emerging as a key area of focus. While
numerous benchmarks have been proposed to evaluate their code generation
abilities, these benchmarks face several critical limitations. First, they
often rely on manual annotations, which are time-consuming and difficult to
scale across different programming languages and problem complexities. Second,
most existing benchmarks focus primarily on Python, while the few multilingual
benchmarks suffer from limited difficulty and uneven language distribution. To
address these challenges, we propose AutoCodeGen, an automated method for
generating high-difficulty multilingual code generation datasets without manual
annotations. AutoCodeGen ensures the correctness and completeness of test cases
by generating test inputs with LLMs and obtaining test outputs through a
multilingual sandbox, while achieving high data quality through reverse-order
problem generation and multiple filtering steps. Using this novel method, we
introduce AutoCodeBench, a large-scale code generation benchmark comprising
3,920 problems evenly distributed across 20 programming languages. It is
specifically designed to evaluate LLMs on challenging, diverse, and practical
multilingual tasks. We evaluate over 30 leading open-source and proprietary
LLMs on AutoCodeBench and its simplified version AutoCodeBench-Lite. The
results show that even the most advanced LLMs struggle with the complexity,
diversity, and multilingual nature of these tasks. Besides, we introduce
AutoCodeBench-Complete, specifically designed for base models to assess their
few-shot code generation capabilities. We hope the AutoCodeBench series will
serve as a valuable resource and inspire the community to focus on more
challenging and practical multilingual code generation scenarios.

Source link

What's Hot

Cerebras Systems Pulls Plug On Its IPO Days After Big Fundraise

OpenAI announces Apps SDK allowing ChatGPT to launch and run third party apps like Zillow, Canva, Spotify

OpenAI launches AgentKit to help developers build and ship AI agents

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators – Takara TLDR

REPAIR: Robust Editing via Progressive Adaptive Intervention and Reintegration – Takara TLDR

How Confident are Video Models? Empowering Video Models to Express their Uncertainty – Takara TLDR

SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys? – Takara TLDR

Morning Links for October 6, 2025

Sotheby’s to Sell René Magritte Held in Same Collection for 100 years

Former ARTnews Publisher Dies at 97

National Gallery of Art Closes as a Result of Government Shutdown

Cerebras Systems Pulls Plug On Its IPO Days After Big Fundraise

OpenAI announces Apps SDK allowing ChatGPT to launch and run third party apps like Zillow, Canva, Spotify

OpenAI launches AgentKit to help developers build and ship AI agents

What's Hot

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators – Takara TLDR

Related Posts

Subscribe to Updates