MSCoRe: A Multilingual And Scalable Benchmark For Skill-based Commonsense Reasoning - Takara TLDR

Recent advancements in reasoning-reinforced Large Language Models (LLMs) have
shown remarkable capabilities in complex reasoning tasks. However, the
mechanism underlying their utilization of different human reasoning skills
remains poorly investigated, especially for multilingual commonsense reasoning
that involves everyday knowledge across different languages and cultures. To
address this gap, we propose a \textbf{M}ultilingual and Scalable Benchmark for
\textbf{S}kill-based \textbf{Co}mmonsense \textbf{Re}asoning (\textbf{mSCoRe}).
Our benchmark incorporates three key components that are designed to
systematically evaluate LLM’s reasoning capabilities, including: (1) a novel
taxonomy of reasoning skills that enables fine-grained analysis of models’
reasoning processes, (2) a robust data synthesis pipeline tailored specifically
for commonsense reasoning evaluation, and (3) a complexity scaling framework
allowing task difficulty to scale dynamically alongside future improvements in
LLM abilities. Extensive experiments on eights state-of-the-art LLMs of varying
sizes and training approaches demonstrate that \textbf{mSCoRe} remains
significantly challenging for current models, particularly at higher complexity
levels. Our results reveal the limitations of such reasoning-reinforced models
when confronted with nuanced multilingual general and cultural commonsense. We
further provide detailed analysis on the models’ reasoning processes,
suggesting future directions for improving multilingual commonsense reasoning
capabilities.

Source link

What's Hot

OpenAI DevDay 2025: Opening Keynote with Sam Altman

3 strategies to retain your entry-level employees

Relativity Launches Rel Labs – Will Invest In Startups – Artificial Lawyer

mSCoRe: a Multilingual and Scalable Benchmark for Skill-based Commonsense Reasoning – Takara TLDR

How Confident are Video Models? Empowering Video Models to Express their Uncertainty – Takara TLDR

SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys? – Takara TLDR

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus – Takara TLDR

Morning Links for October 6, 2025

Sotheby’s to Sell René Magritte Held in Same Collection for 100 years

Former ARTnews Publisher Dies at 97

National Gallery of Art Closes as a Result of Government Shutdown

OpenAI DevDay 2025: Opening Keynote with Sam Altman

3 strategies to retain your entry-level employees

Relativity Launches Rel Labs – Will Invest In Startups – Artificial Lawyer

What's Hot

mSCoRe: a Multilingual and Scalable Benchmark for Skill-based Commonsense Reasoning – Takara TLDR

Related Posts

Subscribe to Updates