Paper page - Breaking the Data Barrier

Graphical User Interface (GUI) agents offer cross-platform solutions for
automating complex digital tasks, with significant potential to transform
productivity workflows. However, their performance is often constrained by the
scarcity of high-quality trajectory data. To address this limitation, we
propose training Vision Language Models (VLMs) on data-rich,
reasoning-intensive tasks during a dedicated mid-training stage, and then
examine how incorporating these tasks facilitates generalization to GUI
planning scenarios. Specifically, we explore a range of tasks with readily
available instruction-tuning data, including GUI perception, multimodal
reasoning, and textual reasoning. Through extensive experiments across 11
mid-training tasks, we demonstrate that: (1) Task generalization proves highly
effective, yielding substantial improvements across most settings. For
instance, multimodal mathematical reasoning enhances performance on
AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data
significantly boosts GUI web agent performance, achieving a 5.6% improvement on
WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal
generalization from text-based to visual domains; (2) Contrary to prior
assumptions, GUI perception data – previously considered closely aligned with
GUI agent tasks and widely utilized for training – has a comparatively limited
impact on final performance; (3) Building on these insights, we identify the
most effective mid-training tasks and curate optimized mixture datasets,
resulting in absolute performance gains of 8.0% on WebArena and 12.2% on
AndroidWorld. Our work provides valuable insights into cross-domain knowledge
transfer for GUI agents and offers a practical approach to addressing data
scarcity challenges in this emerging field. The code, data and models will be
available at https://github.com/hkust-nlp/GUIMid.

Source link

What's Hot

EU Commission: “AI Gigafactories” to strengthen Europe as a business location

Subdivisions.com Builds the Foundation for AI-Ready Real Estate Search and Hyperlocal Discovery

Study: AI-Powered Research Prowess Now Outstrips Human Experts, Raising Bioweapon Risks

Paper page – Breaking the Data Barrier –

Paper page – Interpretable non-linear dimensionality reduction using gaussian weighted linear transformation

Paper page – Step1X-Edit: A Practical Framework for General Image Editing

Paper page – Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

Pete Davidson Teams Up With AXE To Boost Confidence With New Scents

Exquisite, Multifarious Visual Narratives Arouse Emotions In Joana Galego’s First U.S. Solo Show

Broadway Celebrates Earth Day With In-Person And Streamed Live Concert

Lucy Dacus Dazzles New York City Fans On Her ‘Forever Is A Feeling’ Tour

EU Commission: “AI Gigafactories” to strengthen Europe as a business location

Subdivisions.com Builds the Foundation for AI-Ready Real Estate Search and Hyperlocal Discovery

Study: AI-Powered Research Prowess Now Outstrips Human Experts, Raising Bioweapon Risks

What's Hot

Paper page – Breaking the Data Barrier –

Related Posts

Subscribe to Updates