Paper Page - Ming-Omni: A Unified Multimodal Model For Perception And Generation

Ming-Omni is a unified multimodal model with dedicated encoders and modality-specific routers that can process images, text, audio, and video, and performs tasks like speech and image generation, context-aware chatting, and versatile image editing.

We propose Ming-Omni, a unified multimodal model capable of processing
images, text, audio, and video, while demonstrating strong proficiency in both
speech and image generation. Ming-Omni employs dedicated encoders to extract
tokens from different modalities, which are then processed by Ling, an MoE
architecture equipped with newly proposed modality-specific routers. This
design enables a single model to efficiently process and fuse multimodal inputs
within a unified framework, thereby facilitating diverse tasks without
requiring separate models, task-specific fine-tuning, or structural redesign.
Importantly, Ming-Omni extends beyond conventional multimodal models by
supporting audio and image generation. This is achieved through the integration
of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for
high-quality image generation, which also allow the model to engage in
context-aware chatting, perform text-to-speech conversion, and conduct
versatile image editing. Our experimental results showcase Ming-Omni offers a
powerful solution for unified perception and generation across all modalities.
Notably, our proposed Ming-Omni is the first open-source model we are aware of
to match GPT-4o in modality support, and we release all code and model weights
to encourage further research and development in the community.

Source link

What's Hot

Detroit Free Press partners with Perplexity: Why it matters

Upheaval at Aleph Alpha: Founder leaves, Schwarz Group moves up

First Try Matters: Revisiting the Role of Reflection in Reasoning Models – Takara TLDR

Paper page – Ming-Omni: A Unified Multimodal Model for Perception and Generation

First Try Matters: Revisiting the Role of Reflection in Reasoning Models – Takara TLDR

UniVideo: Unified Understanding, Generation, and Editing for Videos – Takara TLDR

Reinforcing Diffusion Models by Direct Group Preference Optimization – Takara TLDR

The Rubin Names 2025 Art Prize, Research and Art Projects Grants

Kochi-Muziris Biennial Announces 66 Artists for December Exhibition

Instagram Launches ‘Rings’ Awards for Creators—With KAWS as a Judge

Museums Prepare to Close Their Doors as Government Shutdown Continues

Detroit Free Press partners with Perplexity: Why it matters

Upheaval at Aleph Alpha: Founder leaves, Schwarz Group moves up

First Try Matters: Revisiting the Role of Reflection in Reasoning Models – Takara TLDR

What's Hot

Paper page – Ming-Omni: A Unified Multimodal Model for Perception and Generation

Related Posts

Subscribe to Updates