Paper Page - Open CaptchaWorld: A Comprehensive Web-based Platform For Testing And Benchmarking Multimodal LLM Agents

CAPTCHAs have been a critical bottleneck for deploying web agents in real-world applications, often blocking them from completing end-to-end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi-step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce Open CaptchaWorld, the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most 40.0% by Browser-Use Openai-o3, far below human-level performance, 93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems.

Source link

What's Hot

Alibaba Cloud Releases the Qwen3-Next Base Model Architecture and Open Sources the 80B-A3B Series_model_this_two

Automatic Memory of Chat Content_has_memory_users’

Indian techie who once worked at IBM Bengaluru left software engineering because…

Paper page – Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

A Survey of Reinforcement Learning for Large Reasoning Models – Takara TLDR

RewardDance: Reward Scaling in Visual Generation – Takara TLDR

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning – Takara TLDR

Sally Mann Says Her Black Men Photos Are ‘Problematic’ in Hindsight

NeueHouse, a Hot Spot for Art Events, Files for Bankruptcy

Obama Presidential Center Announces Nine New Artist Commissions

Italy Protests Return of Carpaccio Altarpiece to Slovenia

Alibaba Cloud Releases the Qwen3-Next Base Model Architecture and Open Sources the 80B-A3B Series_model_this_two

Automatic Memory of Chat Content_has_memory_users’

Indian techie who once worked at IBM Bengaluru left software engineering because…

What's Hot

Paper page – Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

Related Posts

Subscribe to Updates