Paper Page - MINT-CoT: Enabling Interleaved Visual Tokens In Mathematical Chain-of-Thought Reasoning

MINT-CoT enhances multimodal mathematical reasoning by interleaving visual tokens into textual chain-of-thought steps, enabling flexible visual perception and improved problem-solving.

Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large
Language Models (LLMs), but it still remains challenging for extending it to
multimodal domains. Existing works either adopt a similar textual reasoning for
image input, or seek to interleave visual signals into mathematical CoT.
However, they face three key limitations for math problem-solving: reliance on
coarse-grained box-shaped image regions, limited perception of vision encoders
on math content, and dependence on external capabilities for visual
modification. In this paper, we propose MINT-CoT, introducing Mathematical
INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively
interleaves relevant visual tokens into textual reasoning steps via an
Interleave Token, which dynamically selects visual regions of any shapes within
math figures. To empower this capability, we construct the MINT-CoT dataset,
containing 54K mathematical problems aligning each reasoning step with visual
regions at the token level, accompanied by a rigorous data generation pipeline.
We further present a three-stage MINT-CoT training strategy, progressively
combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which
derives our MINT-CoT-7B model. Extensive experiments demonstrate the
effectiveness of our method for effective visual interleaved reasoning in
mathematical domains, where MINT-CoT-7B outperforms the baseline model by
+34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our
code and data are available at https://github.com/xinyan-cxy/MINT-CoT

Source link

What's Hot

Federal government partners with Cohere to enhance AI capabilities

Moveworks Delivers Big Exit For Early VCs

Trump Ties AI Chip Exports to Revenue Sharing

Paper page – MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos – Takara TLDR

Prompt Orchestration Markup Language – Takara TLDR

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer – Takara TLDR

Dallas Museum of Art Names Brian Ferriso as Its Next Director

Rapa Nui’s Moai Statues Threatened by Rising Sea Levels, Flooding

Getty Grants $2.6 M. to Black Visual Arts Archives Across the U.S.

Barbara Hepworth Sculpture Will Remain in UK After £3.8 M. Raised

Federal government partners with Cohere to enhance AI capabilities

Moveworks Delivers Big Exit For Early VCs

Trump Ties AI Chip Exports to Revenue Sharing

What's Hot

Paper page – MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Related Posts

Subscribe to Updates