GUI-KV: Efficient GUI Agents Via KV Cache With Spatio-Temporal Awareness - Takara TLDR

Graphical user interface (GUI) agents built on vision-language models have
emerged as a promising approach to automate human-computer workflows. However,
they also face the inefficiency challenge as they process long sequences of
high-resolution screenshots and solving long-horizon tasks, making inference
slow, costly and memory-bound. While key-value (KV) caching can mitigate this,
storing the full cache is prohibitive for image-heavy contexts. Existing
cache-compression methods are sub-optimal as they do not account for the
spatial and temporal redundancy of GUIs. In this work, we first analyze
attention patterns in GUI agent workloads and find that, unlike in natural
images, attention sparsity is uniformly high across all transformer layers.
This insight motivates a simple uniform budget allocation strategy, which we
show empirically outperforms more complex layer-varying schemes. Building on
this, we introduce GUI-KV, a plug-and-play KV cache compression method for GUI
agents that requires no retraining. GUI-KV combines two novel techniques: (i)
spatial saliency guidance, which augments attention scores with the L2 norm of
hidden states to better preserve semantically important visual tokens, and (ii)
temporal redundancy scoring, which projects previous frames’ keys onto the
current frame’s key subspace to preferentially prune redundant history. Across
standard GUI agent benchmarks and models, GUI-KV outperforms competitive KV
compression baselines, closely matching full-cache accuracy at modest budgets.
Notably, in a 5-screenshot setting on the AgentNetBench benchmark, GUI-KV
reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the
full-cache baseline. These results demonstrate that exploiting GUI-specific
redundancies enables efficient and reliable agent performance.

Source link

What's Hot

OpenAI’s Sora Bans Deepfakes of Public Figures, Except for Dead Celebrities

Google’s Jules enters developers’ toolchains as AI coding agent competition heats up

OpenAI’s Sora 2 is putting safety and censorship to the test

GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness – Takara TLDR

Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum – Takara TLDR

On Predictability of Reinforcement Learning Dynamics for Large Language Models – Takara TLDR

In-Place Feedback: A New Paradigm for Guiding LLMs in Multi-Turn Reasoning – Takara TLDR

Sotheby’s Sells York Avenue HQ to Weill Cornell, Prepares Breuer Move

Outsider Art Fair’s New Director Elizabeth Denny Discusses Her Role

50 Pianos Sound Off in ’11,000 Strings’ at the Park Avenue Armory

Five Arts and Culture Nonprofits Join NYC’s Cultural Institutions Group

OpenAI’s Sora Bans Deepfakes of Public Figures, Except for Dead Celebrities

Google’s Jules enters developers’ toolchains as AI coding agent competition heats up

OpenAI’s Sora 2 is putting safety and censorship to the test

What's Hot

GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness – Takara TLDR

Related Posts

Subscribe to Updates