Mobile-Agent-v3: Foundamental Agents For GUI Automation - Takara TLDR

This paper introduces GUI-Owl, a foundational GUI agent model that achieves
state-of-the-art performance among open-source end-to-end models on ten GUI
benchmarks across desktop and mobile environments, covering grounding, question
answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B
achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose
Mobile-Agent-v3, a general-purpose GUI agent framework that further improves
performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new
state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates
three key innovations: (1) Large-scale Environment Infrastructure: a
cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows,
enabling our Self-Evolving GUI Trajectory Production framework. This generates
high-quality interaction data via automated query generation and correctness
validation, leveraging GUI-Owl to refine trajectories iteratively, forming a
self-improving loop. It supports diverse data pipelines and reduces manual
annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI
grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports
end-to-end decision-making and can act as a modular component in multi-agent
systems. (3) Scalable Environment RL: we develop a scalable reinforcement
learning framework with fully asynchronous training for real-world alignment.
We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for
online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are
open-sourced at https://github.com/X-PLUG/MobileAgent.

Source link

What's Hot

IBM Ships Homegrown “Spyre” Accelerators, Embraces Anthropic For AI Push

MIT Soundly Rejects Trump Admin’s Offer For Priority Federal Funding – IJR

Polymarket And Reflection AI Lead A Varied Lineup Of Megarounds

Mobile-Agent-v3: Foundamental Agents for GUI Automation – Takara TLDR

ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation – Takara TLDR

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning – Takara TLDR

DexNDM: Closing the Reality Gap for Dexterous In-Hand Rotation via Joint-Wise Neural Dynamics Model – Takara TLDR

Frieze to Launch Abu Dhabi Fair in November 2026

Jeff Koons Returns to Gagosian with First New York Show in Seven Years

Ancient Egyptian Iconography Found in Roman-Era Bathhouse in Turkey

London Gallery Harlesden High Street Goes to Mayfair For a Pop-up

IBM Ships Homegrown “Spyre” Accelerators, Embraces Anthropic For AI Push

MIT Soundly Rejects Trump Admin’s Offer For Priority Federal Funding – IJR

Polymarket And Reflection AI Lead A Varied Lineup Of Megarounds

What's Hot

Mobile-Agent-v3: Foundamental Agents for GUI Automation – Takara TLDR

Related Posts

Subscribe to Updates