MSCoRe: A Multilingual And Scalable Benchmark For Skill-based Commonsense Reasoning - Takara TLDR

Recent advancements in reasoning-reinforced Large Language Models (LLMs) have
shown remarkable capabilities in complex reasoning tasks. However, the
mechanism underlying their utilization of different human reasoning skills
remains poorly investigated, especially for multilingual commonsense reasoning
that involves everyday knowledge across different languages and cultures. To
address this gap, we propose a \textbf{M}ultilingual and Scalable Benchmark for
\textbf{S}kill-based \textbf{Co}mmonsense \textbf{Re}asoning (\textbf{mSCoRe}).
Our benchmark incorporates three key components that are designed to
systematically evaluate LLM’s reasoning capabilities, including: (1) a novel
taxonomy of reasoning skills that enables fine-grained analysis of models’
reasoning processes, (2) a robust data synthesis pipeline tailored specifically
for commonsense reasoning evaluation, and (3) a complexity scaling framework
allowing task difficulty to scale dynamically alongside future improvements in
LLM abilities. Extensive experiments on eights state-of-the-art LLMs of varying
sizes and training approaches demonstrate that \textbf{mSCoRe} remains
significantly challenging for current models, particularly at higher complexity
levels. Our results reveal the limitations of such reasoning-reinforced models
when confronted with nuanced multilingual general and cultural commonsense. We
further provide detailed analysis on the models’ reasoning processes,
suggesting future directions for improving multilingual commonsense reasoning
capabilities.

Source link

What's Hot

What impact are AI videos having on brands?

How the Launch of AI Teammates and Moveworks Partnership at Asana (ASAN) Has Changed Its Investment Story

OpenAI, Jony Ive struggle with technical details on secretive new AI gadget

mSCoRe: a Multilingual and Scalable Benchmark for Skill-based Commonsense Reasoning – Takara TLDR

SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys? – Takara TLDR

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus – Takara TLDR

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents – Takara TLDR

Sotheby’s to Sell René Magritte Held in Same Collection for 100 years

Former ARTnews Publisher Dies at 97

National Gallery of Art Closes as a Result of Government Shutdown

Almine Rech Closes London Gallery After More Than a Decade

What impact are AI videos having on brands?

How the Launch of AI Teammates and Moveworks Partnership at Asana (ASAN) Has Changed Its Investment Story

OpenAI, Jony Ive struggle with technical details on secretive new AI gadget

What's Hot

mSCoRe: a Multilingual and Scalable Benchmark for Skill-based Commonsense Reasoning – Takara TLDR

Related Posts

Subscribe to Updates