Artificial intelligence is undergoing the paradigm shift from closed language
models to interconnected agent systems capable of external perception and
information integration. As a representative embodiment, Deep Research Agents
(DRAs) systematically exhibit the capabilities for task decomposition,
cross-source retrieval, multi-stage reasoning, and structured output, which
markedly enhance performance on complex and open-ended tasks. However, existing
benchmarks remain deficient in evaluation dimensions, response formatting, and
scoring mechanisms, limiting their capacity to assess such systems effectively.
This paper introduces a rigorous benchmark and a multidimensional evaluation
framework tailored to DRAs and report-style responses. The benchmark comprises
214 expert-curated challenging queries distributed across 10 broad thematic
domains, each accompanied by manually constructed reference bundles to support
composite evaluation. The framework enables comprehensive evaluation of
long-form reports generated by DRAs, incorporating integrated scoring metrics
for semantic quality, topical focus, and retrieval trustworthiness. Extensive
experimentation confirms the superior performance of mainstream DRAs over
web-search-tool-augmented reasoning models, yet reveals considerable scope for
further improvement. This study provides a robust foundation for capability
assessment, architectural refinement, and paradigm advancement in DRA systems.