SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys? - Takara TLDR

Academic survey writing, which distills vast literature into a coherent and
insightful narrative, remains a labor-intensive and intellectually demanding
task. While recent approaches, such as general DeepResearch agents and
survey-specialized methods, can generate surveys automatically (a.k.a.
LLM4Survey), their outputs often fall short of human standards and there lacks
a rigorous, reader-aligned benchmark for thoroughly revealing their
deficiencies. To fill the gap, we propose a fine-grained, quiz-driven
evaluation framework SurveyBench, featuring (1) typical survey topics source
from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys;
(2) a multifaceted metric hierarchy that assesses the outline quality (e.g.,
coverage breadth, logical coherence), content quality (e.g., synthesis
granularity, clarity of insights), and non-textual richness; and (3) a
dual-mode evaluation protocol that includes content-based and quiz-based
answerability tests, explicitly aligned with readers’ informational needs.
Results show SurveyBench effectively challenges existing LLM4Survey approaches
(e.g., on average 21% lower than human in content-based evaluation).

Source link

What's Hot

How Confident are Video Models? Empowering Video Models to Express their Uncertainty – Takara TLDR

Hyperscale Data to Mine Bitcoin, Expand AI Data Center in Michigan

Google DeepMind unveils CodeMender, an AI agent that autonomously patches software vulnerabilities

SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys? – Takara TLDR

How Confident are Video Models? Empowering Video Models to Express their Uncertainty – Takara TLDR

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus – Takara TLDR

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents – Takara TLDR

Sotheby’s to Sell René Magritte Held in Same Collection for 100 years

Former ARTnews Publisher Dies at 97

National Gallery of Art Closes as a Result of Government Shutdown

Almine Rech Closes London Gallery After More Than a Decade

How Confident are Video Models? Empowering Video Models to Express their Uncertainty – Takara TLDR

Hyperscale Data to Mine Bitcoin, Expand AI Data Center in Michigan

Google DeepMind unveils CodeMender, an AI agent that autonomously patches software vulnerabilities

What's Hot

SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys? – Takara TLDR

Related Posts

Subscribe to Updates