MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge

arXiv:2505.23982v1 Announce Type: new
Abstract: Despite recent advances in large language models (LLMs) for materials science, there is a lack of benchmarks for evaluating their domain-specific knowledge and complex reasoning abilities. To bridge this gap, we introduce MSQA, a comprehensive evaluation benchmark of 1,757 graduate-level materials science questions in two formats: detailed explanatory responses and binary True/False assessments. MSQA distinctively challenges LLMs by requiring both precise factual knowledge and multi-step reasoning across seven materials science sub-fields, such as structure-property relationships, synthesis processes, and computational modeling. Through experiments with 10 state-of-the-art LLMs, we identify significant gaps in current LLM performance. While API-based proprietary LLMs achieve up to 84.5% accuracy, open-source (OSS) LLMs peak around 60.5%, and domain-specific LLMs often underperform significantly due to overfitting and distributional shifts. MSQA represents the first benchmark to jointly evaluate the factual and reasoning capabilities of LLMs crucial for LLMs in advanced materials science.

Source link

What's Hot

Germany urges Apple and Google to remove DeepSeek from app stores over illegal data transfers

Scientists Teach AI to Think About the Roman Empire

When progress doesn’t feel like home: Why many are hesitant to join the AI migration

MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

David Geffen Sued By Estranged Husband for Breach of Contract

Auction House Will Sell Egyptian Artifact Despite Concern From Experts

Anish Kapoor Lists New York Apartment for $17.75 M.

Street Fighter 6 Community Rocked by AI Art Controversy

Germany urges Apple and Google to remove DeepSeek from app stores over illegal data transfers

Scientists Teach AI to Think About the Roman Empire

When progress doesn’t feel like home: Why many are hesitant to join the AI migration

What's Hot

MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge

Related Posts

Subscribe to Updates