
Google DeepMind said its latest Gemini Robotics models can work across multiple robot embodiments. | Source: Google DeepMind
Google DeepMind yesterday introduced two models it claimed “unlock agentic experiences with advanced thinking” as a step toward artificial general intelligence, or AGI, for robots. Its new models are:
Gemini Robotics 1.5: DeepMind said this is its most capable vision-language-action (VLA) model yet. It can turn visual information and instructions into motor commands for a robot to perform a task. It also thinks before taking action and shows its process, enabling robots to assess and complete complex tasks more transparently. The model also learns across embodiments, accelerating skill learning.
Gemini Robotics-ER 1.5: The company said this is its most capable vision-language model (VLM). It reasons about the physical world, natively calls digital tools, and creates detailed, multi-step plans to complete a mission. DeepMind said it now achieves state-of-the-art performance across spatial understanding benchmarks.
DeepMind is making Gemini Robotics-ER 1.5 available to developers via the Gemini application programming interface (API) in Google AI Studio. Gemini Robotics 1.5 is currently available to select partners.
The company asserted that the releases mark an important milestone toward solving AGI in the physical world. By introducing agentic capabilities, Google said it is moving beyond AI models that react to commands and creating systems that can reason, plan, actively use tools, and generalize.
DeepMind designs agentic experiences for physical tasks
Most daily tasks require contextual information and multiple steps to complete, making them notoriously challenging for robots today. That’s why DeepMind designed these two models to work together in an agentic framework.
Gemini Robotics-ER 1.5 orchestrates a robot’s activities, like a high-level brain. DeepMind said this model excels at planning and making logical decisions within physical environments. It has state-of-the-art spatial understanding, interacts in natural language, estimates its success and progress, and can natively call tools like Google Search to look for information or use any third-party user-defined functions.
The VLM gives Gemini Robotics 1.5 natural language instructions for each step, which use its vision and language understanding to directly perform the specific actions. Gemini Robotics 1.5 also helps the robot think about its actions to better solve semantically complex tasks, and can even explain its thinking processes in natural language — making its decisions more transparent.
Both of these models are built on the core Gemini family of models and have been fine-tuned with different datasets to specialize in their respective roles. When combined, they increase the robot’s ability to generalize to longer tasks and more diverse environments, said DeepMind.
Robots can understand environments and think before acting
Gemini Robotics-ER 1.5 is a thinking model optimized for embodied reasoning, said Google DeepMind. The company claimed it “achieves state-of-the-art performance on both academic and internal benchmarks, inspired by real-world use cases from our trusted tester program.”
DeepMind evaluated Gemini Robotics-ER 1.5 on 15 academic benchmarks, including Embodied Reasoning Question Answering (ERQA) and Point-Bench, measuring the model’s performance on pointing, image question answering, and video question answering.
VLA models traditionally translate instructions or linguistic plans directly into a robot’s movement. Gemini Robotics 1.5 goes a step further, allowing a robot to think before taking action, said DeepMind. This means it can generate an internal sequence of reasoning and analysis in natural language to perform tasks that require multiple steps or require a deeper semantic understanding.
“For example, when completing a task like, ‘Sort my laundry by color,’ the robot in the video below thinks at different levels,” wrote DeepMind. “First, it understands that sorting by color means putting the white clothes in the white bin and other colors in the black bin. Then it thinks about steps to take, like picking up the red sweater and putting it in the black bin, and about the detailed motion involved, like moving a sweater closer to pick it up more easily.”
During a multi-level thinking process, the VLA model can decide to turn longer tasks into simpler, shorter segments that the robot can execute successfully. It also helps the model generalize to solve new tasks and be more robust to changes in its environment.
Gemini learns across embodiments
Robots come in all shapes and sizes, and they have different sensing capabilities and different degrees of freedom, making it difficult to transfer motions learned from one robot to another.
DeepMind said Gemini Robotics 1.5 shows a remarkable ability to learn across different embodiments. It can transfer motions learned from one robot to another, without needing to specialize the model to each new embodiment. This accelerates learning new behaviors, helping robots become smarter and more useful.
For example, DeepMind observed that tasks only presented to the ALOHA 2 robot during training, also just work on Apptronik’s humanoid robot, Apollo, and the bi-arm Franka robot, and vice versa.
DeepMind said Gemini Robotics 1.5 implements a holistic approach to safety through high-level semantic reasoning, including thinking about safety before acting, ensuring respectful dialogue with humans via alignment with existing Gemini Safety Policies, and triggering low-level safety sub-systems (e.g. for collision avoidance) on-board the robot when needed.
To guide our safe development of Gemini Robotics models, DeepMind is also releasing an upgrade of the ASIMOV benchmark, a comprehensive collection of datasets for evaluating and improving semantic safety, with better tail coverage, improved annotations, new safety question types, and new video modalities. In its safety evaluations on the ASIMOV benchmark, Gemini Robotics-ER 1.5 shows state-of-the-art performance, and its thinking ability significantly contributes to the improved understanding of semantic safety and better adherence to physical safety constraints.
Editor’s note: RoboBusiness 2025, which will be on Oct. 15 and 16 in Santa Clara, Calif., will include tracks on physical AI and humanoid robots. Registration is now open.
