Improving GUI Grounding With Explicit Position-to-Coordinate Mapping - Takara TLDR

GUI grounding, the task of mapping natural-language instructions to pixel
coordinates, is crucial for autonomous agents, yet remains difficult for
current VLMs. The core bottleneck is reliable patch-to-pixel mapping, which
breaks when extrapolating to high-resolution displays unseen during training.
Current approaches generate coordinates as text tokens directly from visual
features, forcing the model to infer complex position-to-pixel mappings
implicitly; as a result, accuracy degrades and failures proliferate on new
resolutions. We address this with two complementary innovations. First, RULER
tokens serve as explicit coordinate markers, letting the model reference
positions similar to gridlines on a map and adjust rather than generate
coordinates from scratch. Second, Interleaved MRoPE (I-MRoPE) improves spatial
encoding by ensuring that width and height dimensions are represented equally,
addressing the asymmetry of standard positional schemes. Experiments on
ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in
grounding accuracy, with the largest improvements on high-resolution
interfaces. By providing explicit spatial guidance rather than relying on
implicit learning, our approach enables more reliable GUI automation across
diverse resolutions and platforms.

Source link

What's Hot

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents – Takara TLDR

Perplexity’s AI browser Comet could cut need for extra hires, says CEO Aravind Srinivas | Technology News

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping – Takara TLDR

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping – Takara TLDR

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents – Takara TLDR

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation – Takara TLDR

Automated Structured Radiology Report Generation with Rich Clinical Context – Takara TLDR

Former ARTnews Publisher Dies at 97

National Gallery of Art Closes as a Result of Government Shutdown

Almine Rech Closes London Gallery After More Than a Decade

Record Exec and Art Collector Gets Over 4 Years

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents – Takara TLDR

Perplexity’s AI browser Comet could cut need for extra hires, says CEO Aravind Srinivas | Technology News