Visual Jigsaw Post-Training Improves MLLMs - Takara TLDR

Reinforcement learning based post-training has recently emerged as a powerful
paradigm for enhancing the alignment and reasoning capabilities of multimodal
large language models (MLLMs). While vision-centric post-training is crucial
for enhancing MLLMs’ intrinsic understanding of visual signals, current
post-training paradigms are predominantly text-centric, where dense visual
inputs are only leveraged to extract sparse cues for text-based reasoning.
There exist a few approaches in this direction, however, they often still rely
on text as an intermediate mediator or introduce additional visual generative
designs. In this work, we introduce Visual Jigsaw, a generic self-supervised
post-training framework designed to strengthen visual understanding in MLLMs.
Visual Jigsaw is formulated as a general ordering task: visual inputs are
partitioned, shuffled, and the model must reconstruct the visual information by
producing the correct permutation in natural language. This naturally aligns
with reinforcement learning from verifiable rewards (RLVR), requires no
additional visual generative components, and derives its supervisory signal
automatically without any annotations. We instantiate Visual Jigsaw across
three visual modalities, including images, videos, and 3D data. Extensive
experiments demonstrate substantial improvements in fine-grained perception,
temporal reasoning, and 3D spatial understanding. Our findings highlight the
potential of self-supervised vision-centric tasks in post-training MLLMs and
aim to inspire further research on vision-centric pretext designs. Project
Page: https://penghao-wu.github.io/visual_jigsaw/

Source link

What's Hot

SimpleDocs and Law Insider Merge Together – Artificial Lawyer

PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images – Takara TLDR

DeepSeek Has ‘Cracked’ Cheap Long Context for LLMs With Its New Model

Visual Jigsaw Post-Training Improves MLLMs – Takara TLDR

PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images – Takara TLDR

VGGT-X: When VGGT Meets Dense Novel View Synthesis – Takara TLDR

Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation – Takara TLDR

Federal Judge Denies Motion to Dismiss by Kasseem ‘Swizz Beatz’ Dean in 1MBD Scandal Case

Picasso Museum in Paris Plans $59 M. Expansion with New Sculpture Park

Giverny Landscape by Monet Among Top Lots at Bonhams October Sale

You Can Now Borrow Solange’s Art Books from Her Library

SimpleDocs and Law Insider Merge Together – Artificial Lawyer

PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images – Takara TLDR

DeepSeek Has ‘Cracked’ Cheap Long Context for LLMs With Its New Model

What's Hot

Visual Jigsaw Post-Training Improves MLLMs – Takara TLDR

Related Posts

Subscribe to Updates