Google DeepMind officially released the Gemini 2.5 Computer Use model in public preview, a specialised version of Gemini 2.5 Pro built to power AI agents that directly interact with graphical user interfaces (GUIs). This marks a significant move toward creating agents capable of performing complex digital tasks that previously required human-like interaction, tasks like filling out web forms, clicking buttons, and operating behind login screens.
The model is accessible to developers through the Gemini API via Google AI Studio and Vertex AI. Its core purpose is to let agents perform multi-step digital workflows on web browsers and, promisingly, on mobile applications.
How AI Agents Will Use Your Screen
While most AI models communicate with software through structured programming interfaces (APIs), a large number of real-world tasks still rely on human-facing UIs. The Gemini 2.5 Computer Use model attempts to bridge this gap.
The model operates in a continuous loop, mimicking a human user:
Input: The system sends the agent’s current task request, a screenshot of the computer screen (the environment), and a log of recent actions to the model.
Analysis and Action: The model analyses the inputs and returns a function call. This function call represents a specific UI action, such as “click at coordinates X, Y” or “type ‘text’ into a field.”
Execution and Feedback: Client-side code executes the action in the browser. Afterwards, the system captures a new screenshot and the current URL, sending this updated environment back to the model to restart the loop.
This process continues until the original task finishes, an error occurs, or a safety mechanism stops the agent.
Google DeepMind reports the model is currently optimised for web browsers but shows strong initial results for controlling mobile UIs. It does not yet offer full control over desktop operating systems.
Performance and Speed
Independent evaluations by Google and its partners, like Browserbase, indicate the Gemini 2.5 Computer Use model performs well compared to other current solutions.
Benchmarking results show the model achieves superior accuracy on multiple web and mobile control tests, including Online-Mind2Web and WebVoyager. Of particular note for developers, the model reportedly offers leading performance for browser control while maintaining low latency. A lower latency means agents can complete tasks faster, directly translating to better user experience and lower operational costs for business applications.
For instance, Poke.com, an early tester building an AI assistant, stated that the new model finished complex workflows up to 50% faster than other solutions they had considered.
Early Use Cases Emerge
The capability to automate GUI interaction has immediate, tangible business applications. Agents built on this model can manage complex data entry, conduct automated research across multiple websites, and manage user accounts.
Early applications already span several key areas:
UI Testing and Debugging: Google’s internal payments platform team is using the model as a contingency mechanism for end-to-end UI tests. The model can assess a failed test’s screen state and determine the correct actions to complete the workflow. This capability has successfully repaired over 60% of test execution failures that previously required manual developer attention. This saves significant development time and resources.
Workflow Automation: Companies like Autotab, which run fully autonomous agents, report that the model reliably parses context even in difficult scenarios, leading to an up to 18% performance increase on their hardest evaluations. This suggests a higher reliability for crucial tasks like data collection and processing.
Agentic Search and Assistance: Versions of the Computer Use model already power other Google products, including Project Mariner and specific agentic features in AI Mode in Search, showcasing its potential as a general-purpose digital assistant.
Addressing New Safety Risks
Agents that can control a computer interface introduce unique security risks, from potential malicious misuse to accidental, harmful actions like unwanted purchases. To manage these new risks, Google DeepMind built specific safety measures directly into the model’s structure, as detailed in the Gemini 2.5 Computer Use System Card.
Developers also receive controls to govern the agent’s behaviour:
Per-Step Safety Service: An out-of-model service evaluates every action the agent suggests before execution. This offers a final check against risky or harmful commands.
User Confirmation for High-Stakes Actions: The model can request user confirmation before performing sensitive actions, such as making a purchase.
System Instructions: Developers can use system instructions to block the agent from automatically completing actions deemed high-risk, like bypassing security measures, compromising a system’s integrity, or controlling critical infrastructure.
These layered defences are in place to help developers build agents that are both powerful and safe before deployment. Developers should thoroughly test their agents before launching to production.