Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

A New Trick Could Block the Misuse of Open Source AI

C3 AI Stock Is Soaring Today: Here’s Why – C3.ai (NYSE:AI)

Trump’s Tech Sanctions To Empower China, Betray America

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Amazon AWS AI
    • Anthropic (Claude)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • Cohere
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Advanced AI News
Home » Study accuses LM Arena of helping top AI labs game its benchmark
Finance AI

Study accuses LM Arena of helping top AI labs game its benchmark

Advanced AI BotBy Advanced AI BotJuly 1, 2007No Comments6 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


A new paper from AI lab Cohere, Stanford, MIT, and Ai2 accuses LM Arena, the organization behind the popular crowdsourced AI benchmark Chatbot Arena, of helping a select group of AI companies achieve better leaderboard scores at the expense of rivals.

According to the authors, LM Arena allowed some industry-leading AI companies like Meta, OpenAI, Google, and Amazon to privately test several variants of AI models, then not publish the scores of the lowest performers. This made it easier for these companies to achieve a top spot on the platform’s leaderboard, though the opportunity was not afforded to every firm, the authors say.

“Only a handful of [companies] were told that this private testing was available, and the amount of private testing that some [companies] received is just so much more than others,” said Cohere’s VP of AI research and co-author of the study, Sara Hooker, in an interview with TechCrunch. “This is gamification.”

Created in 2023 as an academic research project out of UC Berkeley, Chatbot Arena has become a go-to benchmark for AI companies. It works by putting answers from two different AI models side-by-side in a “battle,” and asking users to choose the best one. It’s not uncommon to see unreleased models competing in the arena under a pseudonym.

Votes over time contribute to a model’s score — and, consequently, its placement on the Chatbot Arena leaderboard. While many commercial actors participate in Chatbot Arena, LM Arena has long maintained that its benchmark is an impartial and fair one.

However, that’s not what the paper’s authors say they uncovered.

One AI company, Meta, was able to privately test 27 model variants on Chatbot Arena between January and March leading up to the tech giant’s Llama 4 release, the authors allege. At launch, Meta only publicly revealed the score of a single model — a model that happened to rank near the top of the Chatbot Arena leaderboard.

<span class="wp-element-caption__text">A chart pulled from the study. (Credit: Singh et al.)</span>
A chart pulled from the study. (Credit: Singh et al.)

In an email to TechCrunch, LM Arena Co-Founder and UC Berkeley Professor Ion Stoica said that the study was full of “inaccuracies” and “questionable analysis.”

“We are committed to fair, community-driven evaluations, and invite all model providers to submit more models for testing and to improve their performance on human preference,” said LM Arena in a statement provided to TechCrunch. “If a model provider chooses to submit more tests than another model provider, this does not mean the second model provider is treated unfairly.”

Story Continues

Armand Joulin, a principal researcher at Google DeepMind, also noted in a post on X that some of the study’s numbers were inaccurate, claiming Google only sent one Gemma 3 AI model to LM Arena for pre-release testing. Hooker responded to Joulin on X, promising the authors would make a correction.

The paper’s authors started conducting their research in November 2024 after learning that some AI companies were possibly being given preferential access to Chatbot Arena. In total, they measured more than 2.8 million Chatbot Arena battles over a five-month stretch.

The authors say they found evidence that LM Arena allowed certain AI companies, including Meta, OpenAI, and Google, to collect more data from Chatbot Arena by having their models appear in a higher number of model “battles.” This increased sampling rate gave these companies an unfair advantage, the authors allege.

Using additional data from LM Arena could improve a model’s performance on Arena Hard, another benchmark LM Arena maintains, by 112%. However, LM Arena said in a post on X that Arena Hard performance does not directly correlate to Chatbot Arena performance.

Hooker said it’s unclear how certain AI companies might’ve received priority access, but that it’s incumbent on LM Arena to increase its transparency regardless.

In a post on X, LM Arena said that several of the claims in the paper don’t reflect reality. The organization pointed to a blog post it published earlier this week indicating that models from non-major labs appear in more Chatbot Arena battles than the study suggests.

One important limitation of the study is that it relied on “self-identification” to determine which AI models were in private testing on Chatbot Arena. The authors prompted AI models several times about their company of origin, and relied on the models’ answers to classify them — a method that isn’t foolproof.

However, Hooker said that when the authors reached out to LM Arena to share their preliminary findings, the organization didn’t dispute them.

TechCrunch reached out to Meta, Google, OpenAI, and Amazon — all of which were mentioned in the study — for comment. None immediately responded.

In the paper, the authors call on LM Arena to implement a number of changes aimed at making Chatbot Arena more “fair.” For example, the authors say, LM Arena could set a clear and transparent limit on the number of private tests AI labs can conduct, and publicly disclose scores from these tests.

In a post on X, LM Arena rejected these suggestions, claiming it has published information on pre-release testing since March 2024. The benchmarking organization also said it “makes no sense to show scores for pre-release models which are not publicly available,” because the AI community cannot test the models for themselves.

The researchers also say LM Arena could adjust Chatbot Arena’s sampling rate to ensure that all models in the arena appear in the same number of battles. LM Arena has been receptive to this recommendation publicly, and indicated that it’ll create a new sampling algorithm.

The paper comes weeks after Meta was caught gaming benchmarks in Chatbot Arena around the launch of its above-mentioned Llama 4 models. Meta optimized one of the Llama 4 models for “conversationality,” which helped it achieve an impressive score on Chatbot Arena’s leaderboard. But the company never released the optimized model — and the vanilla version ended up performing much worse on Chatbot Arena.

At the time, LM Arena said Meta should have been more transparent in its approach to benchmarking.

Earlier this month, LM Arena announced it was launching a company, with plans to raise capital from investors. The study increases scrutiny on private benchmark organization’s — and whether they can be trusted to assess AI models without corporate influence clouding the process.

This article originally appeared on TechCrunch at https://techcrunch.com/2025/04/30/study-accuses-lm-arena-of-helping-top-ai-labs-game-its-benchmark/



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleThe World’s First AI Voice Actor for Real-Time Emotional Control
Next Article Asian shares fall after a quiet day on Wall St, but Nvidia hit by US ban on exporting AI chip
Advanced AI Bot
  • Website

Related Posts

AI could unleash ‘deep societal upheavals’ that many elites are ignoring, Palantir CEO Alex Karp warns

June 7, 2025

UK judge warns of risk to justice after lawyers cited fake AI-generated cases in court

June 7, 2025

Senate Republicans revise ban on state AI regulations in bid to preserve controversial provision

June 6, 2025
Leave A Reply Cancel Reply

Latest Posts

The Timeless Willie Nelson On Positive Thinking

Jiaxing Train Station By Architect Ma Yansong Is A Model Of People-Centric, Green Urban Design

Midwestern Grotto Tradition Celebrated In Sheboygan, WI

Hugh Jackman And Sonia Friedman Boldly Bid To Democratize Theater

Latest Posts

A New Trick Could Block the Misuse of Open Source AI

June 8, 2025

C3 AI Stock Is Soaring Today: Here’s Why – C3.ai (NYSE:AI)

June 8, 2025

Trump’s Tech Sanctions To Empower China, Betray America

June 8, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.