We introduce MCP-Bench, a benchmark for evaluating large language models
(LLMs) on realistic, multi-step tasks that demand tool use, cross-tool
coordination, precise parameter control, and planning/reasoning for solving
tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28
representative live MCP servers spanning 250 tools across domains such as
finance, traveling, scientific computing, and academic search. Unlike prior
API-based benchmarks, each MCP server provides a set of complementary tools
designed to work together, enabling the construction of authentic, multi-step
tasks with rich input-output coupling. Tasks in MCP-Bench test agents’ ability
to retrieve relevant tools from fuzzy instructions without explicit tool names,
plan multi-hop execution trajectories for complex objectives, ground responses
in intermediate tool outputs, and orchestrate cross-domain workflows –
capabilities not adequately evaluated by existing benchmarks that rely on
explicit tool specifications, shallow few-step workflows, and isolated domain
operations. We propose a multi-faceted evaluation framework covering tool-level
schema understanding and usage, trajectory-level planning, and task completion.
Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code
and data: https://github.com/Accenture/mcp-bench.