Paper page - BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

💫 Excited to share our recent work: BrowseComp-ZH, the first high-difficulty benchmark specifically designed to evaluate large language models (LLMs) on Chinese web browsing tasks.

BrowseComp-ZH serves as a critical testbed for assessing:
Reasoning-augmented LLMs
Agent-based search systems
Retrieval-augmented generation (RAG) in non-English contexts

We constructed 289 multi-constraint questions across 11 domains (e.g., Film, Art, History, Medicine), each reverse-engineered from a factual answer and validated through a rigorous two-stage quality control process.

📊 Despite strong performance on existing benchmarks, mainstream models struggled significantly on BrowseComp-ZH:
1️⃣ GPT-4o: 6.2% accuracy
2️⃣ Most models scored below 10%
3️⃣ Even the best-performing system, OpenAI DeepResearch, achieved only 42.9%

Why is this benchmark so challenging?
❗ Chinese web content is highly fragmented across platforms
❗ Tasks demand multi-hop reasoning and cross-page synthesis

This work is a collaboration between HKUST (Guangzhou), Peking University, Zhejiang University, Alibaba, ByteDance, NIO, and others. We hope it contributes to advancing multilingual, tool-using LLM agents and inspires further research in Chinese web intelligence.

Source link

What's Hot

Perplexity AI hits $18B valuation with latest $100M funding round

C3.ai Just Nabbed a Bigger Air Force Contract. Should You Buy AI Stock Here?

Paper page – FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Paper page – BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Paper page – FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Paper page – FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers

Paper page – PhysX: Physical-Grounded 3D Asset Generation

The Art Show 2025 Canceled by ADAA in “Strategic Pause”

Yale Art Gallery Rejects Federal Grants for Africa Migration Show

With NEA Funding Slashed, Black Arts Institutions Face a Tough Future

Erotic Mosaic Held by Nazi Officer Goes on View in Pompeii

Perplexity AI hits $18B valuation with latest $100M funding round

C3.ai Just Nabbed a Bigger Air Force Contract. Should You Buy AI Stock Here?

Paper page – FLEXITOKENS: Flexible Tokenization for Evolving Language Models

What's Hot

Paper page – BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Related Posts

Subscribe to Updates