Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

EU Commission: “AI Gigafactories” to strengthen Europe as a business location

United States, China, and United Kingdom Lead the Global AI Ranking According to Stanford HAI’s Global AI Vibrancy Tool

Foundation AI: Cisco launches AI model for integration in security applications

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Amazon AWS AI
    • Anthropic (Claude)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • Cohere
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Advanced AI News
Home » Lessons from AI Translation to Improve Multilingual LLM Evaluation
Cohere

Lessons from AI Translation to Improve Multilingual LLM Evaluation

Advanced AI BotBy Advanced AI BotApril 29, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


As large language models (LLMs) continue to scale across languages, their evaluation frameworks are struggling to keep pace. Two recent studies — one from Alibaba and academic partners, the other from a collaboration between Cohere and Google — highlight critical challenges in multilingual LLM evaluation.

“As large language models continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress,” Alibaba researchers said, emphasizing that robust multilingual evaluation is “not merely academic but essential.”

Both studies identify similar issues: evaluation practices are inconsistent, underpowered, and frequently biased toward English or high-resource languages. Moreover, current benchmarks often fail to reflect real-world use cases or align with human judgments.

“Evaluation practices […] are still lacking comprehensiveness, scientific rigor, and consistent adoption,” the Google and Cohere researchers said, explaining that these gaps undermine the potential of evaluation frameworks to meaningfully guide multilingual LLM development.

Alibaba also observed “fragmented efforts, limited language coverage, and a mismatch between academic benchmarks and real-world applications.” 

Dominance of High-Resource Languages

The Alibaba study offers an overview of the multilingual evaluation ecosystem, analyzing 2,024 (non-English) benchmark datasets published between 2021 and 2024 across 148 countries.

The researchers observed growth in the size of multilingual benchmarks that “reflects the growing emphasis on large-scale evaluation resources,” estimating that benchmark development costs over 11 million USD between 2021 and 2024.

They explained that multilingual evaluation is crucial to understanding how models perform, “especially given the linguistic diversity and varying resource availability across languages.” 

Despite the focus on non-English benchmarks, English still emerged as the most represented language. High-resource languages like Chinese, Spanish, and French dominate, while many low-resource languages remain underrepresented.

“This distribution underscores the dominance of high-resource languages within our benchmark collection, while highlighting the challenges in achieving broader linguistic representation,” the researchers noted.

They also pointed out that most benchmark content is sourced from general domains like news and social media, while high-stakes domains such as healthcare and law remain underrepresented.

Translating Benchmarks Is “Insufficient”

The Alibaba researchers identified two primary approaches to multilingual evaluation: (i) translating existing English evaluation suites into other languages, and (ii) curating new evaluation resources directly in the target language.

They found that more than 60% of benchmarks were created originally in the target language, rather than being translated from English (either human- or machine-translated). Benchmarks localized natively in the target language correlated more strongly with human evaluations than translated ones. Moreover, human-translated benchmarks correlated better than those translated automatically.

The Alibaba researchers said “translated benchmarks often fail to capture language-specific nuances, cultural contexts, and linguistic features” noting that “simply translating English benchmarks into other languages is insufficient for robust multilingual evaluation.” 

“It underscores the importance of localized benchmarks specifically designed to capture these nuances and contexts,” they added, emphasizing “the critical need for culturally and linguistically authentic evaluation resources.”

This finding echoes a core concern in the Cohere-Google study, which demonstrated that translation artifacts in prompts can distort evaluation outcomes. Their recommendation is clear: prioritize original, target-language prompts wherever possible, and if translation is necessary, carefully document translation quality and methodology.

Challenges in Reporting and Interpreting Results

Beyond the quality of benchmarks themselves, Cohere and Google raised concerns about how evaluation results are reported and interpreted. They highlighted that many multilingual evaluations rely on small test sets — often fewer than 500 prompts per language — and rarely include statistical significance testing. 

Without reporting confidence intervals or effect sizes, it is difficult to determine whether observed differences between models are meaningful or statistically reliable. The researchers warned that is is especially problematic when evaluations rely on LLMs themselves as judges.

Cohere and Google advocate complementing automatic metric-based evaluations with qualitative error analysis and reporting task- and language-specific scores, rather than relying solely on aggregate averages.

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

Lack of Transparency and Need for Standardization

Alibaba researchers emphasized the need for “accurate, contamination-free, challenging, practically relevant, linguistically diverse, and culturally authentic evaluations”, stating that “following these principles is essential for ensuring language technologies serve global populations equitably and perform reliably across a wide range of languages.”

They also outlined critical research directions, including improving representation for low-resource languages, creating culturally localized benchmarks, leveraging LLMs as multilingual judges while addressing inherent biases, and developing efficient benchmarking methods as multilingual complexity increases.

Cohere and Google called for the adoption of standardized evaluation pipelines. They recommend publishing the exact wording of prompts, releasing evaluation code and outputs, and providing versioning details for “full transparency” and reproducibility.

Importantly, the Cohere-Google paper draws a direct link to AI translation research, stating that many of the current challenges in multilingual LLM evaluation are familiar problems that AI translation researchers have already addressed through rigorous evaluation practices.

A Call to Action

The Alibaba researchers concluded with a strong call to action, advocating for a “global collaborative effort to develop human-aligned benchmarks that prioritize real-world applications.”

They emphasized that advancing multilingual LLM evaluation requires “commitment from all stakeholders in the language technology ecosystem,” and recognized the need for a “fundamental shift” in how researchers and practitioners collaborate to address these challenges.

“We aim to catalyze more equitable, representative, and meaningful evaluation methodologies that can better guide the development of truly multilingual language technologies serving the global community,” they wrote.

Authors:
Alibaba paper — Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang
Cohere and Google paper — Julia Kreutzer, Eleftheria Briakou, Sweta Agrawal, Marzieh Fadaee, and Kocmi Tom



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleWhich Is a Better Investment, C3.ai, Inc. or Bitdeer Technologies Group Stock?
Next Article Study: AI-Powered Research Prowess Now Outstrips Human Experts, Raising Bioweapon Risks
Advanced AI Bot
  • Website

Related Posts

AMD turns to AI startups to inform chip, software design

June 14, 2025

Wikipedia cancels plan to test AI summaries after editors skewer the idea

June 13, 2025

AMD’s MI350 Previewed, MI400 Seen as Real Inflection

June 13, 2025
Leave A Reply Cancel Reply

Latest Posts

Ringo Starr Rocks N.Y.C.’s Radio City With A Little Help From His Friends

Charles Sandison Illuminates The Oracle With AI

Live Nation’s Russell Wallach On The LN Partnership With Airbnb

Tehran Galleries React to Israeli Missile Attack

Latest Posts

EU Commission: “AI Gigafactories” to strengthen Europe as a business location

June 14, 2025

United States, China, and United Kingdom Lead the Global AI Ranking According to Stanford HAI’s Global AI Vibrancy Tool

June 14, 2025

Foundation AI: Cisco launches AI model for integration in security applications

June 14, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.