FlashAdventure: A Benchmark For GUI Agents Solving Full Story Arcs In Diverse Adventure Games - Takara TLDR

GUI agents powered by LLMs show promise in interacting with diverse digital
environments. Among these, video games offer a valuable testbed due to their
varied interfaces, with adventure games posing additional challenges through
complex, narrative-driven interactions. Existing game benchmarks, however, lack
diversity and rarely evaluate agents on completing entire storylines. To
address this, we introduce FlashAdventure, a benchmark of 34 Flash-based
adventure games designed to test full story arc completion and tackle the
observation-behavior gap: the challenge of remembering and acting on earlier
gameplay information. We also propose CUA-as-a-Judge, an automated gameplay
evaluator, and COAST, an agentic framework leveraging long-term clue memory to
better plan and solve sequential tasks. Experiments show current GUI agents
struggle with full story arcs, while COAST improves milestone completion by
bridging the observation-behavior gap. Nonetheless, a marked discrepancy
between humans and best-performing agents warrants continued research efforts
to narrow this divide.

Source link

What's Hot

Bret Taylor’s Sierra raises $350M at a $10B valuation

Google Launches AI-Based Writing Tool On Keyboard Gboard

Tesla Taps TikTok’s Volcano Engine to Power Model Y L With Doubao and DeepSeek LLMs

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games – Takara TLDR

Robix: A Unified Model for Robot Interaction, Reasoning and Planning – Takara TLDR

Open Data Synthesis For Deep Research – Takara TLDR

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR – Takara TLDR

Fan Conventions Are Drawing The Line On AI ‘Slop’

Sculptor Who Defined Minimalism Dies at 88

Amy Sherald’s Canceled Smithsonian Show Goes to Baltimore

Rabkin Foundation Names 2025 Arts Journalism Grant Winners

Bret Taylor’s Sierra raises $350M at a $10B valuation

Google Launches AI-Based Writing Tool On Keyboard Gboard

Tesla Taps TikTok’s Volcano Engine to Power Model Y L With Doubao and DeepSeek LLMs

What's Hot

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games – Takara TLDR

Related Posts

Subscribe to Updates