Paper Page - BLIP3-o: A Family Of Fully Open Unified Multimodal Models-Architecture, Training And Dataset

Unifying image understanding and generation has gained growing attention in
recent research on multimodal models. Although design choices for image
understanding have been extensively studied, the optimal model architecture and
training recipe for a unified framework with image generation remain
underexplored. Motivated by the strong potential of autoregressive and
diffusion models for high-quality generation and scalability, we conduct a
comprehensive study of their use in unified multimodal settings, with emphasis
on image representations, modeling objectives, and training strategies.
Grounded in these investigations, we introduce a novel approach that employs a
diffusion transformer to generate semantically rich CLIP image features, in
contrast to conventional VAE-based representations. This design yields both
higher training efficiency and improved generative quality. Furthermore, we
demonstrate that a sequential pretraining strategy for unified models-first
training on image understanding and subsequently on image generation-offers
practical advantages by preserving image understanding capability while
developing strong image generation ability. Finally, we carefully curate a
high-quality instruction-tuning dataset BLIP3o-60k for image generation by
prompting GPT-4o with a diverse set of captions covering various scenes,
objects, human gestures, and more. Building on our innovative model design,
training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art
unified multimodal models. BLIP3-o achieves superior performance across most of
the popular benchmarks spanning both image understanding and generation tasks.
To facilitate future research, we fully open-source our models, including code,
model weights, training scripts, and pretraining and instruction tuning
datasets.

Source link

What's Hot

What’s Happening With IBM Stock?

Putting AI To Work To Stymie The Email Fraudsters And Crooks

Why Big Investors Are All Ears For Voice AI Startups

Paper page – BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward – Takara TLDR

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions – Takara TLDR

Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling – Takara TLDR

Leon Black and Leslie Wexner’s Letters to Jeffrey Epstein Released

School of Visual Arts Transfers Ownership to Nonprofit Alumni Society

Cristin Tierney Moves Gallery to Tribeca for 15th Anniversary Exhibition

Anne Imhof Reimagines Football Jerseys with Nike

What’s Happening With IBM Stock?

Putting AI To Work To Stymie The Email Fraudsters And Crooks

Why Big Investors Are All Ears For Voice AI Startups

What's Hot

Paper page – BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Related Posts

Subscribe to Updates