CogVLA: Cognition-Aligned Vision-Language-Action Model Via Instruction-Driven Routing & Sparsification - Takara TLDR

Recent Vision-Language-Action (VLA) models built on pre-trained
Vision-Language Models (VLMs) require extensive post-training, resulting in
high computational overhead that limits scalability and deployment.We propose
CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages
instruction-driven routing and sparsification to improve both efficiency and
performance. CogVLA draws inspiration from human multimodal coordination and
introduces a 3-stage progressive architecture. 1) Encoder-FiLM based
Aggregation Routing (EFA-Routing) injects instruction information into the
vision encoder to selectively aggregate and compress dual-stream visual tokens,
forming a instruction-aware latent representation. 2) Building upon this
compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing)
introduces action intent into the language model by pruning
instruction-irrelevant visually grounded tokens, thereby achieving token-level
sparsity. 3) To ensure that compressed perception inputs can still support
accurate and coherent action generation, we introduce V-L-A Coupled Attention
(CAtten), which combines causal vision-language attention with bidirectional
action parallel decoding. Extensive experiments on the LIBERO benchmark and
real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art
performance with success rates of 97.4% and 70.0%, respectively, while reducing
training costs by 2.5-fold and decreasing inference latency by 2.8-fold
compared to OpenVLA. CogVLA is open-sourced and publicly available at
https://github.com/JiuTian-VL/CogVLA.

Source link

What's Hot

Recruit with Emotional Intelligence | Recruiting News Network

Could Google Chrome soon be OpenAI Chrome?

Will Expanding Partnerships Shape the Next Phase of Growth for C3.ai?

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification – Takara TLDR

Mixture of Contexts for Long Video Generation – Takara TLDR

OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn Dialogue with Large Language Models – Takara TLDR

FakeParts: a New Family of AI-Generated DeepFakes – Takara TLDR

Woodmere Art Museum Sues Trump Administration Over Canceled IMLS Grant

Australian School Faces Pushback over AI Art Course—and More Art News

London Museum Secures Banksy’s Piranhas

Egyptian Antiquities Trafficker Sentenced to Six Months in Prison

Recruit with Emotional Intelligence | Recruiting News Network

Could Google Chrome soon be OpenAI Chrome?

Will Expanding Partnerships Shape the Next Phase of Growth for C3.ai?

What's Hot

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification – Takara TLDR

Related Posts

Subscribe to Updates