Paper Page - ResearchBench: Benchmarking LLMs In Scientific Discovery Via Inspiration-Based Task Decomposition

Large language models (LLMs) have demonstrated potential in assisting
scientific research, yet their ability to discover high-quality research
hypotheses remains unexamined due to the lack of a dedicated benchmark. To
address this gap, we introduce the first large-scale benchmark for evaluating
LLMs with a near-sufficient set of sub-tasks of scientific discovery:
inspiration retrieval, hypothesis composition, and hypothesis ranking. We
develop an automated framework that extracts critical components – research
questions, background surveys, inspirations, and hypotheses – from scientific
papers across 12 disciplines, with expert validation confirming its accuracy.
To prevent data contamination, we focus exclusively on papers published in
2024, ensuring minimal overlap with LLM pretraining data. Our evaluation
reveals that LLMs perform well in retrieving inspirations, an
out-of-distribution task, suggesting their ability to surface novel knowledge
associations. This positions LLMs as “research hypothesis mines”, capable of
facilitating automated scientific discovery by generating innovative hypotheses
at scale with minimal human intervention.

Source link

What's Hot

Global Number One! Tencent’s Hunyuan Translation Model Hunyuan-MT-7B Tops Open Source Hot List_model_the

Elon Musk is setting high expectations for Tesla AI5 and AI6 chips

Multilingual Compatibility Upgrade, Performance Leap Opens New Applications!_The_as_and

Paper page – ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth – Takara TLDR

Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding – Takara TLDR

Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers – Takara TLDR

Tony Shafrazi and the Art of the Comeback

Basquiats Linked to 1MDB Scandal Auctioned by US Government

US Ambassador to UK Fills Residence with Impressionist Masters

New Code of Ethics Implores UK Museums to End Fossil Fuel Sponsorships

Global Number One! Tencent’s Hunyuan Translation Model Hunyuan-MT-7B Tops Open Source Hot List_model_the

Elon Musk is setting high expectations for Tesla AI5 and AI6 chips

Multilingual Compatibility Upgrade, Performance Leap Opens New Applications!_The_as_and

What's Hot

Paper page – ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Related Posts

Subscribe to Updates