Paper Page - WebGen-Bench: Evaluating LLMs On Generating Interactive And Functional Websites From Scratch

LLM‑based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent’s ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we generate test cases targeting each functionality described in the instructions. These test cases are then manually filtered, refined, and organized to ensure accuracy, resulting in a total of 647 test cases. Each test case specifies an operation to be performed on the website and the expected outcome of the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute test cases on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks—Bolt.diy, OpenHands, and Aider—using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt.diy powered by DeepSeek-R1, achieves only 27.8% accuracy on the test cases, highlighting the challenging nature of our benchmark. Additionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories generated from a subset of the training set achieves an accuracy of 38.2%, surpassing the performance of the best proprietary model. We release our data-generation, training, and testing code, along with both the datasets and model weights at https://github.com/mnluzimu/WebGen-Bench.

Source link

What's Hot

InfiniHuman: Infinite 3D Human Creation with Precise Control – Takara TLDR

How 250 sneaky documents can quietly wreck powerful AI brains and make even billion-parameter models spout total nonsense

OpenAI Teases Option to Create ‘Erotica for Adults’ Using ChatGPT

Paper page – WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

InfiniHuman: Infinite 3D Human Creation with Precise Control – Takara TLDR

Diffusion Transformers with Representation Autoencoders – Takara TLDR

QeRL: Beyond Efficiency — Quantization-enhanced Reinforcement Learning for LLMs – Takara TLDR

Egyptian Archaeologists Discover Large New Kingdom Military Fortress

Joan Weinstein to Head Vice President for Getty-Wide Program Planning

India Plots First Venice Biennale Pavilion in Seven Years

Massive Moai Statues Once ‘Walked’ to Their Platforms on Easter Island

InfiniHuman: Infinite 3D Human Creation with Precise Control – Takara TLDR

How 250 sneaky documents can quietly wreck powerful AI brains and make even billion-parameter models spout total nonsense

OpenAI Teases Option to Create ‘Erotica for Adults’ Using ChatGPT

What's Hot

Paper page – WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

Related Posts

Subscribe to Updates