From General Vision Language Models To Versatile GUI Agents

[Submitted on 17 Jun 2024 (v1), last revised 30 May 2025 (this version, v2)]

Authors:Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun

View a PDF of the paper titled GUICourse: From General Vision Language Models to Versatile GUI Agents, by Wentong Chen and 13 other authors

View PDF
HTML (experimental)

Abstract:Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at this https URL.

Submission history

From: Wentong Chen [view email]
[v1]
Mon, 17 Jun 2024 08:30:55 UTC (7,315 KB)
[v2]
Fri, 30 May 2025 07:30:07 UTC (8,008 KB)

Source link

What's Hot

Tech Brief (Sept. 24): Mercedes-Benz, ByteDance Partner on In-Car AI

Perplexity AI Browser Now Available in India- How is it different from Google Chrome?

Mano Report – Takara TLDR

From General Vision Language Models to Versatile GUI Agents

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

Court Rules ‘Gender Ideology’ Ban on Art Endowments Unconstitutional

Rural Danish Art Museum Acquires Painting By Artemisia Gentileschi

Dan Nadel Is Expanding American Art History, One Outlier at a Time

Bernard Arnault Says French Wealth Tax Will ‘Destroy’ the Economy