Benchmarking Autonomous Agents On Deterministic Simulations Of Real Websites

[Submitted on 15 Apr 2025 (v1), last revised 17 Apr 2025 (this version, v2)]

Authors:Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Tomas Abraham, Michael Lara, Federico Lopez, James Liu, Atharva Gundawar, Prannay Hebbar, Youngchul Joo, Jindong Gu, Charles London, Christian Schroeder de Witt, Sumeet Motwani

View a PDF of the paper titled REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites, by Divyansh Garg and 17 other authors

View PDF

Abstract:We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across domains such as e-commerce, travel, communication, and professional networking. We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions requiring both accurate information retrieval and state-changing actions. All interactions occur within this fully controlled setting, eliminating safety risks and enabling robust, reproducible evaluation of agent capability and reliability. Our novel evaluation framework combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval. The framework supports both open-source and proprietary agent systems through a flexible evaluation harness that accommodates black-box commands within browser environments, allowing research labs to test agentic systems without modification. Our empirical results show that frontier language models achieve at most a 41% success rate on REAL, highlighting critical gaps in autonomous web navigation and task completion capabilities. Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation, marking a significant step forward in evaluating and advancing agent capabilities.

Submission history

From: Sumeet Motwani [view email]
[v1]
Tue, 15 Apr 2025 18:22:55 UTC (2,252 KB)
[v2]
Thu, 17 Apr 2025 16:28:46 UTC (2,252 KB)

Source link

What's Hot

EpiCache: Episodic KV Cache Management for Long Conversational Question Answering – Takara TLDR

GSA Secures Meta Llama AI Agreement for Federal Government Use

Dedicated mobile apps for vibe coding have so far failed to gain traction

Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

Court Rules ‘Gender Ideology’ Ban on Art Endowments Unconstitutional

Rural Danish Art Museum Acquires Painting By Artemisia Gentileschi

Dan Nadel Is Expanding American Art History, One Outlier at a Time

Bernard Arnault Says French Wealth Tax Will ‘Destroy’ the Economy