Paper Page - COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

Multimodal Large Language Models (MLLMs) excel at simple vision-language
tasks but struggle when faced with complex tasks that require multiple
capabilities, such as simultaneously recognizing objects, counting them, and
understanding their spatial relationships. This might be partially the result
of the fact that Visual Instruction Tuning (VIT), a critical training step for
MLLMs, has traditionally focused on scaling data volume, but not the
compositional complexity of training examples. We propose COMPACT
(COMPositional Atomic-to-complex visual Capability Tuning), which generates a
training dataset explicitly controlling for the compositional complexity of the
training examples. The data from COMPACT allows MLLMs to train on combinations
of atomic capabilities to learn complex capabilities more efficiently. Across
all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT
while using less than 10% of its data budget, and even outperforms it on
several, especially those involving complex multi-capability tasks. For
example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0%
improvement on MM-Vet compared to the full-scale VIT on particularly complex
questions that require four or more atomic capabilities. COMPACT offers a
scalable, data-efficient, visual compositional tuning recipe to improve on
complex visual-language tasks.

Source link

What's Hot

Anaconda Report Links AI Slowdown to Gaps in Data Governance

Tyson Foods elevates customer search experience with an AI-powered conversational assistant

AI Isn’t Coming for Hollywood. It’s Already Arrived

Paper page – COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos – Takara TLDR

Prompt Orchestration Markup Language – Takara TLDR

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer – Takara TLDR

Dallas Museum of Art Names Brian Ferriso as Its Next Director

Rapa Nui’s Moai Statues Threatened by Rising Sea Levels, Flooding

Mickalene Thomas Accused of Harassment by Racquel Chevremont

AI Impact on Art Galleries, and More Art News

Anaconda Report Links AI Slowdown to Gaps in Data Governance

Tyson Foods elevates customer search experience with an AI-powered conversational assistant

AI Isn’t Coming for Hollywood. It’s Already Arrived

What's Hot

Paper page – COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

Related Posts

Subscribe to Updates