DeepScholar-Bench: A Live Benchmark And Automated Evaluation For Generative Research Synthesis - Takara TLDR

The ability to research and synthesize knowledge is central to human
expertise and progress. An emerging class of systems promises these exciting
capabilities through generative research synthesis, performing retrieval over
the live web and synthesizing discovered sources into long-form, cited
summaries. However, evaluating such systems remains an open challenge: existing
question-answering benchmarks focus on short-form factual responses, while
expert-curated datasets risk staleness and data contamination. Both fail to
capture the complexity and evolving nature of real research synthesis tasks. In
this work, we introduce DeepScholar-bench, a live benchmark and holistic,
automated evaluation framework designed to evaluate generative research
synthesis. DeepScholar-bench draws queries from recent, high-quality ArXiv
papers and focuses on a real research synthesis task: generating the related
work sections of a paper by retrieving, synthesizing, and citing prior
research. Our evaluation framework holistically assesses performance across
three key dimensions, knowledge synthesis, retrieval quality, and
verifiability. We also develop DeepScholar-base, a reference pipeline
implemented efficiently using the LOTUS API. Using the DeepScholar-bench
framework, we perform a systematic evaluation of prior open-source systems,
search AI’s, OpenAI’s DeepResearch, and DeepScholar-base. We find that
DeepScholar-base establishes a strong baseline, attaining competitive or higher
performance than each other method. We also find that DeepScholar-bench remains
far from saturated, with no system exceeding a score of $19\%$ across all
metrics. These results underscore the difficulty of DeepScholar-bench, as well
as its importance for progress towards AI systems capable of generative
research synthesis. We make our code available at
https://github.com/guestrin-lab/deepscholar-bench.

Source link

What's Hot

MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment – Takara TLDR

Lawsuit Against OpenAI And ChatGPT Raises Hard Questions About When AI Makers Should Be Reporting User Prompts

Tencent Hunyuan Video-Foley brings lifelike audio to AI video

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis – Takara TLDR

MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment – Takara TLDR

VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space – Takara TLDR

Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels – Takara TLDR

Artifacts From 2,000-Year-old Sunken City Lifted Out of the Sea

Fita Threatens Legal Action for Uni’s Trans-Inclusive Museum Guidance

Claire Oliver Gallery Expands in New York’s Harlem Neighborhood

Van Gogh Museum Threatens Dutch Government with Closure

MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment – Takara TLDR

Lawsuit Against OpenAI And ChatGPT Raises Hard Questions About When AI Makers Should Be Reporting User Prompts

Tencent Hunyuan Video-Foley brings lifelike audio to AI video

What's Hot

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis – Takara TLDR

Related Posts

Subscribe to Updates