PEEK: Guiding And Minimal Image Representations For Zero-Shot Generalization Of Robot Manipulation Policies - Takara TLDR

Robotic manipulation policies often fail to generalize because they must
simultaneously learn where to attend, what actions to take, and how to execute
them. We argue that high-level reasoning about where and what can be offloaded
to vision-language models (VLMs), leaving policies to specialize in how to act.
We present PEEK (Policy-agnostic Extraction of Essential Keypoints), which
fine-tunes VLMs to predict a unified point-based intermediate representation:
1. end-effector paths specifying what actions to take, and 2. task-relevant
masks indicating where to focus. These annotations are directly overlaid onto
robot observations, making the representation policy-agnostic and transferable
across architectures. To enable scalable training, we introduce an automatic
annotation pipeline, generating labeled data across 20+ robot datasets spanning
9 embodiments. In real-world evaluations, PEEK consistently boosts zero-shot
generalization, including a 41.4x real-world improvement for a 3D policy
trained only in simulation, and 2-3.5x gains for both large VLAs and small
manipulation policies. By letting VLMs absorb semantic and visual complexity,
PEEK equips manipulation policies with the minimal cues they need–where, what,
and how. Website at https://peek-robot.github.io/.

Source link

What's Hot

Nvidia’s $100 Billion OpenAI Pact Buys Time in the Custom Chip Race

MIT CSAIL researchers develop computer vision system for robots

Why IBM (IBM) Stock Is Up Today

PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies – Takara TLDR

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective – Takara TLDR

Soft Tokens, Hard Truths – Takara TLDR

Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation – Takara TLDR

Burmese Curator Flees Thailand After China Censors Art Exhibition

New Research Reveals Source for Dog in Rembrandt’s ‘Night Watch’

Treasures Recovered from Titanic Sister Ship Britannic Off Greek Coast

Superheroes Take Over the Met Opera House in “Super Duper”

Nvidia’s $100 Billion OpenAI Pact Buys Time in the Custom Chip Race

MIT CSAIL researchers develop computer vision system for robots

Why IBM (IBM) Stock Is Up Today

What's Hot

PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies – Takara TLDR

Related Posts

Subscribe to Updates