Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

‘Mega-deals’ could be inflating overall AI funding figures

Report: Meta is hitting pause on AI hiring after its poaching spree

Strategies to Bring Teams Together

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
Cohere

Cohere Labs head calls “unreliable” AI leaderboard rankings a “crisis” in the field

By Advanced AI EditorMay 2, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Cohere-led study claims popular crowd-sourced leaderboard LM Arena tipped scales for Google, OpenAI.

The head of Cohere’s research division is concerned that alleged unreliability in the rankings of a popular chatbot leaderboard amounts to a “crisis” in artificial intelligence (AI) development. 

A new study co-authored by Sara Hooker, head of Cohere Labs, along with researchers at Cohere and leading universities, claims that large AI companies have been “gaming” the crowd-sourced chatbot ranking platform LM Arena to boost the ranking of their large language models (LLMs).

“I hope that there is an active discussion and a sense of integrity that maybe this paper has spurred.”

Sara Hooker
Cohere

“One of the benchmarks that is most widely used, most highly visible, has shown a clear pattern of unreliability in rankings,” Hooker, who is also vice-president of research at Cohere, said in an interview with BetaKit. 

Hooker and her co-authors are trying to highlight a what she says are lacks of transparency and trustworthiness that are eroding the value of an AI model leaderboard widely used by academia, enterprises, and the public.

The research paper, titled “The Leaderboard Illusion,” was written by researchers from Cohere, Cohere Labs, Stanford University, Princeton University, the University of Waterloo, the University of Washington, MIT, and Ai2. It was published on the open-access platform ArXiv and has not yet been peer reviewed.

LM Arena’s “Chatbot Arena” has become a leading public metric for ranking LLMs. It was spun out from a research project at the University of California, Berkeley. The “arena” gimmick comes from users comparing the performance of two chatbots side-by-side in a “battle” and voting for the winner. 

The paper authors accuse LM Arena of allowing leading AI developers—such as Meta, Google, and OpenAI—to conduct extensive pre-release private testing and retract scores for models that did not perform as well. It also claims these developers get more testing opportunities, or more “battles,” giving them access to more data compared to open-source providers. The authors say this results in “preferential treatment” at the expense of competitors. 

RELATED: Did Cohere give Canada its DeepSeek moment?

The paper claims that LM Arena only made it clear to some model providers that they could run multiple pre-release tests at once. According to the analysis, Meta privately tested 27 versions of Llama-4 before releasing its final model, which ranked high on the leaderboard when it debuted. 

Meta had already been caught uploading a different version of Llama-4 to Chatbot Arena that was optimized for human preference. In response, LM Arena updated its policy and stated that Meta’s conduct “did not match what we expect from model providers.” 

In a post on X, LM Arena denied the notion that some model providers are treated unfairly and listed a number of “factual errors” in the paper. The organization said its policy, which allows model providers to run pre-release testing, had been public for a while. 

Cohere’s Hooker directly responded on X to some of the critiques LM Arena raised about the paper, and thanked LM Arena for its engagement.

“I’m hoping we have a more substantial conversation. We want changes to the arena,” Hooker told BetaKit. “This was an uncomfortable paper to write.”

The paper’s authors called on LM Arena to cap the number of private variant models that a model provider can submit. They also called for a ban on retracting submitted scores, improvements to sampling fairness, and greater transparency surrounding which models are removed and when. 

“I hope that there is an active discussion and a sense of integrity that maybe this paper has spurred,” Hooker said. “It’s so critical that we acknowledge that this is just bad science.”

RELATED: At World Summit AI, cautious tone of researchers drowned out by cutthroat adoption race

Cohere Labs is the non-profit research lab run out of Toronto-based LLM developer Cohere. As Canada’s largest entrant in the uber-competitive, capital-intensive AI race, Cohere’s models have not soared to the top of the Chatbot Arena leaderboard. Its Command A model, which the company claims outperformed OpenAI’s GPT-4o from last November as well as DeepSeek’s v3, is ranked 19th. 

Cohere caters exclusively to enterprise clients, putting it in a narrower boat than some competitors, but most tech giants are making enterprise plays as tech companies feel pressure to integrate AI into their workflows. 

Alternative benchmarking

Deval Pandya, VP of engineering at the Toronto-based not-for-profit Vector Institute, told BetaKit that this discourse highlights a need to continue improving AI model evaluations. 

The Vector Institute is the brainchild of Waabi CEO Raquel Urtasun, Deep Genomics CEO Brendan Frey, and Canadian Nobel Prize winner Geoffrey Hinton. The non-profit research institute recently released its own comprehensive evaluation of 11 leading AI models. 

Vector’s AI model leaderboard is not dynamic or crowd-sourced. Instead, it displays scores for nearly a dozen scientific benchmarks, from mathematical reasoning tasks to code generation.

To Pandya, an evaluation like Vector’s serves a different but still important purpose than Chatbot Arena. He argued that consumers can benefit from crowd-sourced data based on human preference, while enterprises might want something more granular if they are looking to mix and match different AI models for a business use case. 

AI companies have an incentive to self-report only the best progress they have made, especially when they are public companies, Pandya said. The challenge is to pursue objective model evaluations to make sure the claims companies make are true. And he said there’s a need for more independent projects like Vector’s to help evaluate all available models, not just those in the limelight. 

“The goal is to democratize how we think about evaluations,” Pandya said. 

Feature image courtesy World Economic Forum, CC BY-NC-SA 2.0, via Flickr. 



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleTrump Aims to Eliminate the National Endowment for the Humanities
Next Article Study: AI-Powered Research Prowess Now Outstrips Human Experts, Raising Bioweapon Risks
Advanced AI Editor
  • Website

Related Posts

Federal government partners with Cohere to enhance AI capabilities

August 20, 2025

Federal government taps Cohere to work on use of AI in public service

August 20, 2025

Ottawa taps Cohere to work on use of AI in public service

August 20, 2025
Leave A Reply

Latest Posts

Tanya Bonakdar Gallery to Close Los Angeles Space

Ancient Silver Coins Suggest New History of Trading in Southeast Asia

Sasan Ghandehari Sues Christie’s Over Picasso Once Owned by a Criminal

Ancient Roman Villa in Sicily Reveals Mosaic of Flip-Flops

Latest Posts

‘Mega-deals’ could be inflating overall AI funding figures

August 21, 2025

Report: Meta is hitting pause on AI hiring after its poaching spree

August 21, 2025

Strategies to Bring Teams Together

August 21, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • ‘Mega-deals’ could be inflating overall AI funding figures
  • Report: Meta is hitting pause on AI hiring after its poaching spree
  • Strategies to Bring Teams Together
  • Bank forced to rehire workers after lying about chatbot productivity, union says
  • Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs – Takara TLDR

Recent Comments

  1. Eugeneder on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  2. DonaldNip on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  3. Eugeneder on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  4. AlfonzoDeelt on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  5. StanleyPusia on Foundation AI: Cisco launches AI model for integration in security applications

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.