As large language models (LLMs) continue to scale across languages, their evaluation frameworks are struggling to keep pace. Two recent studies — one from Alibaba and academic partners, the other from a collaboration between Cohere and Google — highlight critical challenges in multilingual LLM evaluation.
“As large language models continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress,” Alibaba researchers said, emphasizing that robust multilingual evaluation is “not merely academic but essential.”
Both studies identify similar issues: evaluation practices are inconsistent, underpowered, and frequently biased toward English or high-resource languages. Moreover, current benchmarks often fail to reflect real-world use cases or align with human judgments.
“Evaluation practices […] are still lacking comprehensiveness, scientific rigor, and consistent adoption,” the Google and Cohere researchers said, explaining that these gaps undermine the potential of evaluation frameworks to meaningfully guide multilingual LLM development.
Alibaba also observed “fragmented efforts, limited language coverage, and a mismatch between academic benchmarks and real-world applications.”
Dominance of High-Resource Languages
The Alibaba study offers an overview of the multilingual evaluation ecosystem, analyzing 2,024 (non-English) benchmark datasets published between 2021 and 2024 across 148 countries.
The researchers observed growth in the size of multilingual benchmarks that “reflects the growing emphasis on large-scale evaluation resources,” estimating that benchmark development costs over 11 million USD between 2021 and 2024.
They explained that multilingual evaluation is crucial to understanding how models perform, “especially given the linguistic diversity and varying resource availability across languages.”
Despite the focus on non-English benchmarks, English still emerged as the most represented language. High-resource languages like Chinese, Spanish, and French dominate, while many low-resource languages remain underrepresented.
“This distribution underscores the dominance of high-resource languages within our benchmark collection, while highlighting the challenges in achieving broader linguistic representation,” the researchers noted.
They also pointed out that most benchmark content is sourced from general domains like news and social media, while high-stakes domains such as healthcare and law remain underrepresented.
Translating Benchmarks Is “Insufficient”
The Alibaba researchers identified two primary approaches to multilingual evaluation: (i) translating existing English evaluation suites into other languages, and (ii) curating new evaluation resources directly in the target language.
They found that more than 60% of benchmarks were created originally in the target language, rather than being translated from English (either human- or machine-translated). Benchmarks localized natively in the target language correlated more strongly with human evaluations than translated ones. Moreover, human-translated benchmarks correlated better than those translated automatically.
The Alibaba researchers said “translated benchmarks often fail to capture language-specific nuances, cultural contexts, and linguistic features” noting that “simply translating English benchmarks into other languages is insufficient for robust multilingual evaluation.”
“It underscores the importance of localized benchmarks specifically designed to capture these nuances and contexts,” they added, emphasizing “the critical need for culturally and linguistically authentic evaluation resources.”
This finding echoes a core concern in the Cohere-Google study, which demonstrated that translation artifacts in prompts can distort evaluation outcomes. Their recommendation is clear: prioritize original, target-language prompts wherever possible, and if translation is necessary, carefully document translation quality and methodology.
Challenges in Reporting and Interpreting Results
Beyond the quality of benchmarks themselves, Cohere and Google raised concerns about how evaluation results are reported and interpreted. They highlighted that many multilingual evaluations rely on small test sets — often fewer than 500 prompts per language — and rarely include statistical significance testing.
Without reporting confidence intervals or effect sizes, it is difficult to determine whether observed differences between models are meaningful or statistically reliable. The researchers warned that is is especially problematic when evaluations rely on LLMs themselves as judges.
Cohere and Google advocate complementing automatic metric-based evaluations with qualitative error analysis and reporting task- and language-specific scores, rather than relying solely on aggregate averages.
Lack of Transparency and Need for Standardization
Alibaba researchers emphasized the need for “accurate, contamination-free, challenging, practically relevant, linguistically diverse, and culturally authentic evaluations”, stating that “following these principles is essential for ensuring language technologies serve global populations equitably and perform reliably across a wide range of languages.”
They also outlined critical research directions, including improving representation for low-resource languages, creating culturally localized benchmarks, leveraging LLMs as multilingual judges while addressing inherent biases, and developing efficient benchmarking methods as multilingual complexity increases.
Cohere and Google called for the adoption of standardized evaluation pipelines. They recommend publishing the exact wording of prompts, releasing evaluation code and outputs, and providing versioning details for “full transparency” and reproducibility.
Importantly, the Cohere-Google paper draws a direct link to AI translation research, stating that many of the current challenges in multilingual LLM evaluation are familiar problems that AI translation researchers have already addressed through rigorous evaluation practices.
A Call to Action
The Alibaba researchers concluded with a strong call to action, advocating for a “global collaborative effort to develop human-aligned benchmarks that prioritize real-world applications.”
They emphasized that advancing multilingual LLM evaluation requires “commitment from all stakeholders in the language technology ecosystem,” and recognized the need for a “fundamental shift” in how researchers and practitioners collaborate to address these challenges.
“We aim to catalyze more equitable, representative, and meaningful evaluation methodologies that can better guide the development of truly multilingual language technologies serving the global community,” they wrote.
Authors:
Alibaba paper — Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang
Cohere and Google paper — Julia Kreutzer, Eleftheria Briakou, Sweta Agrawal, Marzieh Fadaee, and Kocmi Tom