Paper Page - MiCo: Multi-image Contrast For Reinforcement Visual Reasoning

Self-supervised learning using image triplets enhances the reasoning ability of Vision-Language Models (VLMs) on multi-image tasks without the need for human-annotated question-answer pairs.

This work explores enabling Chain-of-Thought (CoT) reasoning to link visual
cues across multiple images. A straightforward solution is to adapt rule-based
reinforcement learning for Vision-Language Models (VLMs). However, such methods
typically rely on manually curated question-answer pairs, which can be
particularly challenging when dealing with fine grained visual details and
complex logic across images. Inspired by self-supervised visual representation
learning, we observe that images contain inherent constraints that can serve as
supervision. Based on this insight, we construct image triplets comprising two
augmented views of the same image and a third, similar but distinct image.
During training, the model is prompted to generate a reasoning process to
compare these images (i.e., determine same or different). Then we optimize the
model with rule-based reinforcement learning. Due to the high visual similarity
and the presence of augmentations, the model must attend to subtle visual
changes and perform logical reasoning to succeed. Experiments show that,
although trained solely on visual comparison tasks, the learned reasoning
ability generalizes effectively to a wide range of questions. Without relying
on any human-annotated question-answer pairs, our method achieves significant
improvements on multi-image reasoning benchmarks and shows strong performance
on general vision tasks.

Source link

What's Hot

New requirements for apps available in Texas – Latest News

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation – Takara TLDR

Implement a secure MLOps platform based on Terraform and GitHub

Paper page – MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation – Takara TLDR

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation – Takara TLDR

Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context – Takara TLDR

Matthiesen Gallery Files Lawsuit Over Gustave Courbet Painting

MoMA Partners with Mattel for Van Gogh Barbie, Monet and Dalí Figures

Underground Film Legend and Artist Dies at 92

Artwork Forfeited by Inigo Philbrick’s Partner Flops at Sotheby’s

New requirements for apps available in Texas – Latest News

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation – Takara TLDR

Implement a secure MLOps platform based on Terraform and GitHub

What's Hot

Paper page – MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Related Posts

Subscribe to Updates