DeepResearch Arena: The First Exam Of LLMs' Research Abilities Via Seminar-Grounded Tasks - Takara TLDR

Deep research agents have attracted growing attention for their potential to
orchestrate multi-stage research workflows, spanning literature synthesis,
methodological design, and empirical verification. Despite these strides,
evaluating their research capability faithfully is rather challenging due to
the difficulty of collecting frontier research questions that genuinely capture
researchers’ attention and intellectual curiosity. To address this gap, we
introduce DeepResearch Arena, a benchmark grounded in academic seminars that
capture rich expert discourse and interaction, better reflecting real-world
research environments and reducing the risk of data leakage. To automatically
construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task
Generation (MAHTG) system that extracts research-worthy inspirations from
seminar transcripts. The MAHTG system further translates research-worthy
inspirations into high-quality research tasks, ensuring the traceability of
research task formulation while filtering noise. With the MAHTG system, we
curate DeepResearch Arena with over 10,000 high-quality research tasks from
over 200 academic seminars, spanning 12 disciplines, such as literature,
history, and science. Our extensive evaluation shows that DeepResearch Arena
presents substantial challenges for current state-of-the-art agents, with clear
performance gaps observed across different models.

Source link

What's Hot

Google Gemini dubbed ‘high risk’ for kids and teens in new safety assessment

Towards a Unified View of Large Language Model Post-Training – Takara TLDR

OpenAI boss Sam Altman dons metaphorical hot dog suit as he realises, huh, there sure are a lot of annoying AI-powered bots online these days

DeepResearch Arena: The First Exam of LLMs’ Research Abilities via Seminar-Grounded Tasks – Takara TLDR

Towards a Unified View of Large Language Model Post-Training – Takara TLDR

From Editor to Dense Geometry Estimator – Takara TLDR

Transition Models: Rethinking the Generative Learning Objective – Takara TLDR

Basquiats Linked to 1MDB Scandal Auctioned by US Government

US Ambassador to UK Fills Residence with Impressionist Masters

New Code of Ethics Implores UK Museums to End Fossil Fuel Sponsorships

Art Basel Paris Director Clément Delépine to Lead Lafayette Anticipations

Google Gemini dubbed ‘high risk’ for kids and teens in new safety assessment

Towards a Unified View of Large Language Model Post-Training – Takara TLDR

OpenAI boss Sam Altman dons metaphorical hot dog suit as he realises, huh, there sure are a lot of annoying AI-powered bots online these days

What's Hot

DeepResearch Arena: The First Exam of LLMs’ Research Abilities via Seminar-Grounded Tasks – Takara TLDR

Related Posts

Subscribe to Updates