Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

EU Commission: “AI Gigafactories” to strengthen Europe as a business location

United States, China, and United Kingdom Lead the Global AI Ranking According to Stanford HAI’s Global AI Vibrancy Tool

Foundation AI: Cisco launches AI model for integration in security applications

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Amazon AWS AI
    • Anthropic (Claude)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • Cohere
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Facebook X (Twitter) Instagram
Advanced AI News
Home » Cohere Labs head calls “unreliable” AI leaderboard rankings a “crisis” in the field
Cohere

Cohere Labs head calls “unreliable” AI leaderboard rankings a “crisis” in the field

Advanced AI EditorBy Advanced AI EditorMay 2, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Cohere-led study claims popular crowd-sourced leaderboard LM Arena tipped scales for Google, OpenAI.

The head of Cohere’s research division is concerned that alleged unreliability in the rankings of a popular chatbot leaderboard amounts to a “crisis” in artificial intelligence (AI) development. 

A new study co-authored by Sara Hooker, head of Cohere Labs, along with researchers at Cohere and leading universities, claims that large AI companies have been “gaming” the crowd-sourced chatbot ranking platform LM Arena to boost the ranking of their large language models (LLMs).

“I hope that there is an active discussion and a sense of integrity that maybe this paper has spurred.”

Sara Hooker
Cohere

“One of the benchmarks that is most widely used, most highly visible, has shown a clear pattern of unreliability in rankings,” Hooker, who is also vice-president of research at Cohere, said in an interview with BetaKit. 

Hooker and her co-authors are trying to highlight a what she says are lacks of transparency and trustworthiness that are eroding the value of an AI model leaderboard widely used by academia, enterprises, and the public.

The research paper, titled “The Leaderboard Illusion,” was written by researchers from Cohere, Cohere Labs, Stanford University, Princeton University, the University of Waterloo, the University of Washington, MIT, and Ai2. It was published on the open-access platform ArXiv and has not yet been peer reviewed.

LM Arena’s “Chatbot Arena” has become a leading public metric for ranking LLMs. It was spun out from a research project at the University of California, Berkeley. The “arena” gimmick comes from users comparing the performance of two chatbots side-by-side in a “battle” and voting for the winner. 

The paper authors accuse LM Arena of allowing leading AI developers—such as Meta, Google, and OpenAI—to conduct extensive pre-release private testing and retract scores for models that did not perform as well. It also claims these developers get more testing opportunities, or more “battles,” giving them access to more data compared to open-source providers. The authors say this results in “preferential treatment” at the expense of competitors. 

RELATED: Did Cohere give Canada its DeepSeek moment?

The paper claims that LM Arena only made it clear to some model providers that they could run multiple pre-release tests at once. According to the analysis, Meta privately tested 27 versions of Llama-4 before releasing its final model, which ranked high on the leaderboard when it debuted. 

Meta had already been caught uploading a different version of Llama-4 to Chatbot Arena that was optimized for human preference. In response, LM Arena updated its policy and stated that Meta’s conduct “did not match what we expect from model providers.” 

In a post on X, LM Arena denied the notion that some model providers are treated unfairly and listed a number of “factual errors” in the paper. The organization said its policy, which allows model providers to run pre-release testing, had been public for a while. 

Cohere’s Hooker directly responded on X to some of the critiques LM Arena raised about the paper, and thanked LM Arena for its engagement.

“I’m hoping we have a more substantial conversation. We want changes to the arena,” Hooker told BetaKit. “This was an uncomfortable paper to write.”

The paper’s authors called on LM Arena to cap the number of private variant models that a model provider can submit. They also called for a ban on retracting submitted scores, improvements to sampling fairness, and greater transparency surrounding which models are removed and when. 

“I hope that there is an active discussion and a sense of integrity that maybe this paper has spurred,” Hooker said. “It’s so critical that we acknowledge that this is just bad science.”

RELATED: At World Summit AI, cautious tone of researchers drowned out by cutthroat adoption race

Cohere Labs is the non-profit research lab run out of Toronto-based LLM developer Cohere. As Canada’s largest entrant in the uber-competitive, capital-intensive AI race, Cohere’s models have not soared to the top of the Chatbot Arena leaderboard. Its Command A model, which the company claims outperformed OpenAI’s GPT-4o from last November as well as DeepSeek’s v3, is ranked 19th. 

Cohere caters exclusively to enterprise clients, putting it in a narrower boat than some competitors, but most tech giants are making enterprise plays as tech companies feel pressure to integrate AI into their workflows. 

Alternative benchmarking

Deval Pandya, VP of engineering at the Toronto-based not-for-profit Vector Institute, told BetaKit that this discourse highlights a need to continue improving AI model evaluations. 

The Vector Institute is the brainchild of Waabi CEO Raquel Urtasun, Deep Genomics CEO Brendan Frey, and Canadian Nobel Prize winner Geoffrey Hinton. The non-profit research institute recently released its own comprehensive evaluation of 11 leading AI models. 

Vector’s AI model leaderboard is not dynamic or crowd-sourced. Instead, it displays scores for nearly a dozen scientific benchmarks, from mathematical reasoning tasks to code generation.

To Pandya, an evaluation like Vector’s serves a different but still important purpose than Chatbot Arena. He argued that consumers can benefit from crowd-sourced data based on human preference, while enterprises might want something more granular if they are looking to mix and match different AI models for a business use case. 

AI companies have an incentive to self-report only the best progress they have made, especially when they are public companies, Pandya said. The challenge is to pursue objective model evaluations to make sure the claims companies make are true. And he said there’s a need for more independent projects like Vector’s to help evaluate all available models, not just those in the limelight. 

“The goal is to democratize how we think about evaluations,” Pandya said. 

Feature image courtesy World Economic Forum, CC BY-NC-SA 2.0, via Flickr. 



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleTrump Aims to Eliminate the National Endowment for the Humanities
Next Article Study: AI-Powered Research Prowess Now Outstrips Human Experts, Raising Bioweapon Risks
Advanced AI Editor
  • Website

Related Posts

Cohere Health’s Review Assist Accelerates Clinical Reviews for Health Plans with Precision AI

June 20, 2025

Is the Anti-Trump Opposition Getting Its #Resistance Back?

June 20, 2025

G7 Summit | National News

June 18, 2025
Leave A Reply Cancel Reply

Latest Posts

An Apartment By One Of Mexico’s Buzziest Designers Is Open To Book In San Miguel

Songtsam Resorts Launch Collaboration Inspired By Tibet’s Sacred Lake

Spanish Supreme Court Orders Heirs to Return Cathedral Statues

ARTnews Polled 10 Digital Art Experts To Find Out Their Favorite Digital Art Works

Latest Posts

EU Commission: “AI Gigafactories” to strengthen Europe as a business location

June 21, 2025

United States, China, and United Kingdom Lead the Global AI Ranking According to Stanford HAI’s Global AI Vibrancy Tool

June 21, 2025

Foundation AI: Cisco launches AI model for integration in security applications

June 21, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.