Paper page - ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Tencent Hunyuan Releases ArtifactsBench: A Next-Generation “What-You-See-Is-What-You-Get” Evaluation Standard for Code Generation

ArtifactsBench is designed to comprehensively measure large language models (LLMs) on their ability to generate visually rich, interactive, and dynamic code artifacts. As AI code generation enters a new phase, ArtifactsBench provides the industry with a precise yardstick for evaluating and advancing models from “able to write code” to “able to write high-quality, user-friendly code.”

Facing the Challenge: Built for Visual and Interactive Code
Traditional programming benchmarks focus mainly on algorithmic correctness and overlook the crucial aspects of visual presentation and user experience in modern applications. ArtifactsBench is specifically created to fill this gap. It consists of 1,825 meticulously crafted tasks of unprecedented breadth and depth, covering nine real-world scenarios—from static web components and SVG data visualizations to mini-games and management systems with complex interaction logic. All tasks are stratified by difficulty, enabling systematic assessment of a model’s visual code-generation capabilities across varying complexity levels.

Core Innovation: A Fully Automated, Multimodal Evaluation Pipeline
The standout feature of ArtifactsBench is its novel multimodal, automated evaluation paradigm. The pipeline first uses scripted interactions with the model-generated visual artifacts (e.g., web pages, applications) while simultaneously recording screenshots and GIFs. These dynamic visual materials, along with task requirements, are then submitted to a Multimodal Large Language Model as Judge (MLLM-as-Judge) for evaluation. Guided by fine-grained, task-specific checklists, the judge delivers comprehensive, objective, and reproducible scores.

Value Validation: Highly Consistent with Human Experts
The authority of any benchmark hinges on the credibility of its conclusions. Therefore, we conducted a large-scale alignment study comparing ArtifactsBench’s automated evaluation results with the fully human-voted WebDev Arena. The findings reveal that ArtifactsBench’s model rankings achieve an impressive 94.4% consistency with human expert preferences. This remarkable figure demonstrates that ArtifactsBench’s automated evaluation workflow can reliably replace traditional manual assessments and become the gold standard for measuring the visual and interactive quality of code artifacts.

腾讯混元重磅发布 ArtifactsBench：迈向“所见即所得”的下一代代码生成评测标准

ArtifactsBench 旨在全面衡量大语言模型（LLM）在生成视觉丰富、可交互的动态代码制品方面的能力。随着AI代码生成进入新阶段，ArtifactsBench 的出现，为业界提供了一把精准的标尺，以评估和推动模型从“能写代码”到“写出高品质、用户体验友好的代码”的跨越。

直面挑战：为视觉与交互代码而生
传统的编程评测大多聚焦于算法的逻辑正确性，却忽视了现代应用中至关重要的视觉呈现和用户交互体验。ArtifactsBench 正是为了填补这一空白而设计。它包含 1,825个精心构建的任务，其广度与深度前所未有，覆盖了从静态网页组件、SVG数据可视化，到具有复杂交互逻辑的小游戏和管理系统等九大真实世界场景。所有任务均按难度分层，能够系统性地评估模型在不同复杂度下的视觉代码生成能力。

核心创新：全自动、多模态的评测流程
ArtifactsBench 的最大亮点在于其新颖的多模态自动化评测范式。该流程首先通过程序化脚本与模型生成的视觉制品（如网页、应用）并同步录制屏幕截图与GIF动图。随后，这些富含动态过程的视觉材料，将连同任务要求一起，交由一个“多模态大模型裁判”（MLLM-as-Judge）进行评估。该裁判依据为每个任务量身定制的细粒度清单，进行全面、客观且可复现的打分。

价值验证：与人类专家的眼光高度一致
一个评测基准的价值，取决于其结论的权威性。为此，我们将 ArtifactsBench 的自动评测结果与广受认可的、完全由人工投票裁决的 WebDev Arena 进行了大规模对齐验证。结果显示，ArtifactsBench 的模型排名与人类专家的偏好排序一致性高达 94.4%。这一惊人的数据有力地证明，ArtifactsBench 的自动化评估流程能够高度可靠地替代传统的人工评测，成为衡量代码制品视觉与交互质量的黄金标准。

Source link

What's Hot

California lawmaker behind SB 1047 reignites push for mandated AI safety reports

TU Wien Rendering #33 – Metropolis Light Transport

PixVerse AI Video Generator: Solving Creative Content Challenges with Advanced AIGC Technology in 2024 | AI News Detail

Paper page – ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Paper page – High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Paper page – The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation

Paper page – Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation

Adam Lindemann to Close Venus Over Manhattan After 14 Years

Ed Sheeran Is Ripping Off Jackson Pollock with His Paintings

Art Basel Selects Artist Wael Shawky to Lead Forthcoming Qatar Fair

Pioneer Works Hosts a MSCHF Sculpture You Can Take Home by the Inch

California lawmaker behind SB 1047 reignites push for mandated AI safety reports

TU Wien Rendering #33 – Metropolis Light Transport

PixVerse AI Video Generator: Solving Creative Content Challenges with Advanced AIGC Technology in 2024 | AI News Detail

What's Hot

Paper page – ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Related Posts

Subscribe to Updates