Interleaving Reasoning For Better Text-to-Image Generation - Takara TLDR

Unified multimodal understanding and generation models recently have achieve
significant improvement in image generation capability, yet a large gap remains
in instruction following and detail preservation compared to systems that
tightly couple comprehension with generation such as GPT-4o. Motivated by
recent advances in interleaving reasoning, we explore whether such reasoning
can further improve Text-to-Image (T2I) generation. We introduce Interleaving
Reasoning Generation (IRG), a framework that alternates between text-based
thinking and image synthesis: the model first produces a text-based thinking to
guide an initial image, then reflects on the result to refine fine-grained
details, visual quality, and aesthetics while preserving semantics. To train
IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL),
which targets two sub-goals: (1) strengthening the initial think-and-generate
stage to establish core content and base quality, and (2) enabling high-quality
textual reflection and faithful implementation of those refinements in a
subsequent image. We curate IRGL-300K, a dataset organized into six decomposed
learning modes that jointly cover learning text-based thinking, and full
thinking-image trajectories. Starting from a unified foundation model that
natively emits interleaved text-image outputs, our two-stage training first
builds robust thinking and reflection, then efficiently tunes the IRG pipeline
in the full thinking-image trajectory data. Extensive experiments show SoTA
performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF,
GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality
and fine-grained fidelity. The code, model weights and datasets will be
released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .

Source link

What's Hot

ASML makes $1.5 billion investment in Mistral AI — ASML becomes the largest shareholder for the French AI start-up

Inside Trump’s Private Dinner with Top AI Leaders

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts – Takara TLDR

Interleaving Reasoning for Better Text-to-Image Generation – Takara TLDR

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts – Takara TLDR

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers – Takara TLDR

R^textbf{2AI}: Towards Resistant and Resilient AI in an Evolving World – Takara TLDR

Anne Imhof Reimagines Football Jerseys with Nike

Jason Wu, Robert Rauschenberg Collaboration for New York Fashion Week

Storied Collector and MoMA Trustee Dies at 92

Congress Obtains Drawing Trump Apparently Made for Jeffrey Epstein

ASML makes $1.5 billion investment in Mistral AI — ASML becomes the largest shareholder for the French AI start-up

Inside Trump’s Private Dinner with Top AI Leaders

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts – Takara TLDR

What's Hot

Interleaving Reasoning for Better Text-to-Image Generation – Takara TLDR

Related Posts

Subscribe to Updates