💫 Excited to share our recent work: BrowseComp-ZH, the first high-difficulty benchmark specifically designed to evaluate large language models (LLMs) on Chinese web browsing tasks.
BrowseComp-ZH serves as a critical testbed for assessing:
Reasoning-augmented LLMs
Agent-based search systems
Retrieval-augmented generation (RAG) in non-English contexts
We constructed 289 multi-constraint questions across 11 domains (e.g., Film, Art, History, Medicine), each reverse-engineered from a factual answer and validated through a rigorous two-stage quality control process.
📊 Despite strong performance on existing benchmarks, mainstream models struggled significantly on BrowseComp-ZH:
1️⃣ GPT-4o: 6.2% accuracy
2️⃣ Most models scored below 10%
3️⃣ Even the best-performing system, OpenAI DeepResearch, achieved only 42.9%
Why is this benchmark so challenging?
❗ Chinese web content is highly fragmented across platforms
❗ Tasks demand multi-hop reasoning and cross-page synthesis
This work is a collaboration between HKUST (Guangzhou), Peking University, Zhejiang University, Alibaba, ByteDance, NIO, and others. We hope it contributes to advancing multilingual, tool-using LLM agents and inspires further research in Chinese web intelligence.