Paper Page - When AI Co-Scientists Fail: SPOT-a Benchmark For Automated Verification Of Scientific Research

Recent advances in large language models (LLMs) have fueled the vision of
automated scientific discovery, often called AI Co-Scientists. To date, prior
work casts these systems as generative co-authors responsible for crafting
hypotheses, synthesizing code, or drafting manuscripts. In this work, we
explore a complementary application: using LLMs as verifiers to automate the
academic verification of scientific manuscripts. To that end, we
introduce SPOT, a dataset of 83 published papers paired with 91 errors
significant enough to prompt errata or retraction, cross-validated with actual
authors and human annotators. Evaluating state-of-the-art LLMs on SPOT, we find
that none surpasses 21.1\% recall or 6.1\% precision (o3 achieves the best
scores, with all others near zero). Furthermore, confidence estimates are
uniformly low, and across eight independent runs, models rarely rediscover the
same errors, undermining their reliability. Finally, qualitative analysis with
domain experts reveals that even the strongest models make mistakes resembling
student-level misconceptions derived from misunderstandings. These findings
highlight the substantial gap between current LLM capabilities and the
requirements for dependable AI-assisted academic verification.

Source link

What's Hot

AI is going pretty much as I expected

C3.ai Reports 19% Revenue Fall in Q1

Transition Models: Rethinking the Generative Learning Objective – Takara TLDR

Paper page – When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research

Transition Models: Rethinking the Generative Learning Objective – Takara TLDR

NER Retriever: Zero-Shot Named Entity Retrieval with Type-Aware Embeddings – Takara TLDR

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions? – Takara TLDR

Basquiats Linked to 1MDB Scandal Auctioned by US Government

Morning Links for September 5, 2025

Fan Conventions Are Drawing The Line On AI ‘Slop’

Sculptor Who Defined Minimalism Dies at 88

AI is going pretty much as I expected

C3.ai Reports 19% Revenue Fall in Q1

Transition Models: Rethinking the Generative Learning Objective – Takara TLDR

What's Hot

Paper page – When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research

Related Posts

Subscribe to Updates