CogVLA: Cognition-Aligned Vision-Language-Action Model Via Instruction-Driven Routing & Sparsification - Takara TLDR

Recent Vision-Language-Action (VLA) models built on pre-trained
Vision-Language Models (VLMs) require extensive post-training, resulting in
high computational overhead that limits scalability and deployment.We propose
CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages
instruction-driven routing and sparsification to improve both efficiency and
performance. CogVLA draws inspiration from human multimodal coordination and
introduces a 3-stage progressive architecture. 1) Encoder-FiLM based
Aggregation Routing (EFA-Routing) injects instruction information into the
vision encoder to selectively aggregate and compress dual-stream visual tokens,
forming a instruction-aware latent representation. 2) Building upon this
compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing)
introduces action intent into the language model by pruning
instruction-irrelevant visually grounded tokens, thereby achieving token-level
sparsity. 3) To ensure that compressed perception inputs can still support
accurate and coherent action generation, we introduce V-L-A Coupled Attention
(CAtten), which combines causal vision-language attention with bidirectional
action parallel decoding. Extensive experiments on the LIBERO benchmark and
real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art
performance with success rates of 97.4% and 70.0%, respectively, while reducing
training costs by 2.5-fold and decreasing inference latency by 2.8-fold
compared to OpenVLA. CogVLA is open-sourced and publicly available at
https://github.com/JiuTian-VL/CogVLA.

Source link

What's Hot

OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn Dialogue with Large Language Models – Takara TLDR

Introducing auto scaling on Amazon SageMaker HyperPod

Anthropic warns that its Claude AI is being ‘weaponized’ by hackers to write malicious code

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification – Takara TLDR

OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn Dialogue with Large Language Models – Takara TLDR

FakeParts: a New Family of AI-Generated DeepFakes – Takara TLDR

ROSE: Remove Objects with Side Effects in Videos – Takara TLDR

Australian School Faces Pushback over AI Art Course—and More Art News

London Museum Secures Banksy’s Piranhas

Egyptian Antiquities Trafficker Sentenced to Six Months in Prison

Sotheby’s to Launch First Series of Luxury Auctions in Abu Dhabi

OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn Dialogue with Large Language Models – Takara TLDR

Introducing auto scaling on Amazon SageMaker HyperPod

Anthropic warns that its Claude AI is being ‘weaponized’ by hackers to write malicious code

What's Hot

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification – Takara TLDR

Related Posts

Subscribe to Updates