Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

California lawmaker behind SB 1047 reignites push for mandated AI safety reports

TU Wien Rendering #33 – Metropolis Light Transport

PixVerse AI Video Generator: Solving Creative Content Challenges with Advanced AIGC Technology in 2024 | AI News Detail

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Industry AI
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
Hugging Face

Paper page – ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

By Advanced AI EditorJuly 8, 2025No Comments2 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Tencent Hunyuan Releases ArtifactsBench: A Next-Generation “What-You-See-Is-What-You-Get” Evaluation Standard for Code Generation

ArtifactsBench is designed to comprehensively measure large language models (LLMs) on their ability to generate visually rich, interactive, and dynamic code artifacts. As AI code generation enters a new phase, ArtifactsBench provides the industry with a precise yardstick for evaluating and advancing models from “able to write code” to “able to write high-quality, user-friendly code.”

Facing the Challenge: Built for Visual and Interactive Code
Traditional programming benchmarks focus mainly on algorithmic correctness and overlook the crucial aspects of visual presentation and user experience in modern applications. ArtifactsBench is specifically created to fill this gap. It consists of 1,825 meticulously crafted tasks of unprecedented breadth and depth, covering nine real-world scenarios—from static web components and SVG data visualizations to mini-games and management systems with complex interaction logic. All tasks are stratified by difficulty, enabling systematic assessment of a model’s visual code-generation capabilities across varying complexity levels.

Core Innovation: A Fully Automated, Multimodal Evaluation Pipeline
The standout feature of ArtifactsBench is its novel multimodal, automated evaluation paradigm. The pipeline first uses scripted interactions with the model-generated visual artifacts (e.g., web pages, applications) while simultaneously recording screenshots and GIFs. These dynamic visual materials, along with task requirements, are then submitted to a Multimodal Large Language Model as Judge (MLLM-as-Judge) for evaluation. Guided by fine-grained, task-specific checklists, the judge delivers comprehensive, objective, and reproducible scores.

Value Validation: Highly Consistent with Human Experts
The authority of any benchmark hinges on the credibility of its conclusions. Therefore, we conducted a large-scale alignment study comparing ArtifactsBench’s automated evaluation results with the fully human-voted WebDev Arena. The findings reveal that ArtifactsBench’s model rankings achieve an impressive 94.4% consistency with human expert preferences. This remarkable figure demonstrates that ArtifactsBench’s automated evaluation workflow can reliably replace traditional manual assessments and become the gold standard for measuring the visual and interactive quality of code artifacts.

腾讯混元重磅发布 ArtifactsBench:迈向“所见即所得”的下一代代码生成评测标准

ArtifactsBench 旨在全面衡量大语言模型(LLM)在生成视觉丰富、可交互的动态代码制品方面的能力。随着AI代码生成进入新阶段,ArtifactsBench 的出现,为业界提供了一把精准的标尺,以评估和推动模型从“能写代码”到“写出高品质、用户体验友好的代码”的跨越。

直面挑战:为视觉与交互代码而生
传统的编程评测大多聚焦于算法的逻辑正确性,却忽视了现代应用中至关重要的视觉呈现和用户交互体验。ArtifactsBench 正是为了填补这一空白而设计。它包含 1,825个精心构建的任务,其广度与深度前所未有,覆盖了从静态网页组件、SVG数据可视化,到具有复杂交互逻辑的小游戏和管理系统等九大真实世界场景。所有任务均按难度分层,能够系统性地评估模型在不同复杂度下的视觉代码生成能力。

核心创新:全自动、多模态的评测流程
ArtifactsBench 的最大亮点在于其新颖的 多模态自动化评测范式。该流程首先通过程序化脚本与模型生成的视觉制品(如网页、应用)并同步录制屏幕截图与GIF动图。随后,这些富含动态过程的视觉材料,将连同任务要求一起,交由一个“多模态大模型裁判”(MLLM-as-Judge)进行评估。该裁判依据为每个任务量身定制的细粒度清单,进行全面、客观且可复现的打分。

价值验证:与人类专家的眼光高度一致
一个评测基准的价值,取决于其结论的权威性。为此,我们将 ArtifactsBench 的自动评测结果与广受认可的、完全由人工投票裁决的 WebDev Arena 进行了大规模对齐验证。结果显示,ArtifactsBench 的模型排名与人类专家的偏好排序一致性高达 94.4%。这一惊人的数据有力地证明,ArtifactsBench 的自动化评估流程能够高度可靠地替代传统的人工评测,成为衡量代码制品视觉与交互质量的黄金标准。



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleAccelerating data science innovation: How Bayer Crop Science used AWS AI/ML services to build their next-generation MLOps service
Next Article C3.ai vs. UiPath: Which AI Automation Stock Is the Better Buy in 2025?
Advanced AI Editor
  • Website

Related Posts

Paper page – High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

July 9, 2025

Paper page – The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation

July 9, 2025

Paper page – Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation

July 9, 2025

Comments are closed.

Latest Posts

Adam Lindemann to Close Venus Over Manhattan After 14 Years

Ed Sheeran Is Ripping Off Jackson Pollock with His Paintings

Art Basel Selects Artist Wael Shawky to Lead Forthcoming Qatar Fair

Pioneer Works Hosts a MSCHF Sculpture You Can Take Home by the Inch

Latest Posts

California lawmaker behind SB 1047 reignites push for mandated AI safety reports

July 9, 2025

TU Wien Rendering #33 – Metropolis Light Transport

July 9, 2025

PixVerse AI Video Generator: Solving Creative Content Challenges with Advanced AIGC Technology in 2024 | AI News Detail

July 9, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • California lawmaker behind SB 1047 reignites push for mandated AI safety reports
  • TU Wien Rendering #33 – Metropolis Light Transport
  • PixVerse AI Video Generator: Solving Creative Content Challenges with Advanced AIGC Technology in 2024 | AI News Detail
  • Tesla Semi completes 5,000-mile winter trial with thyssenkrupp
  • Are LLMs starting to become sentient?

Recent Comments

  1. Account binance on itel debuts CITY series with CITY 100 new model: A stylish, durable & DeepSeek AI-powered smartphone for Gen Z

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.