Qwen3-Omni Technical Report - Takara TLDR

We present Qwen3-Omni, a single multimodal model that, for the first time,
maintains state-of-the-art performance across text, image, audio, and video
without any degradation relative to single-modal counterparts. Qwen3-Omni
matches the performance of same-sized single-modal models within the Qwen
series and excels particularly on audio tasks. Across 36 audio and audio-visual
benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall
SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro,
Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE
architecture that unifies perception and generation across text, images, audio,
and video, yielding fluent text and natural real-time speech. It supports text
interaction in 119 languages, speech understanding in 19 languages, and speech
generation in 10 languages. To reduce first-packet latency in streaming
synthesis, Talker autoregressively predicts discrete speech codecs using a
multi-codebook scheme. Leveraging the representational capacity of these
codebooks, we replace computationally intensive block-wise diffusion with a
lightweight causal ConvNet, enabling streaming from the first codec frame. In
cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet
latency of 234 ms. To further strengthen multimodal reasoning, we introduce a
Thinking model that explicitly reasons over inputs from any modality. Since the
research community currently lacks a general-purpose audio captioning model, we
fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which
produces detailed, low-hallucination captions for arbitrary audio inputs.
Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and
Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0
license.

Source link

What's Hot

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models – Takara TLDR

Meta Allows US Allies to Access Llama AI Models for Military Training

Will OpenAI Really Build 60 Football Fields Worth of AI Infrastructure Per Week?

Qwen3-Omni Technical Report – Takara TLDR

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models – Takara TLDR

AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing? – Takara TLDR

ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment – Takara TLDR

Court Rules ‘Gender Ideology’ Ban on Art Endowments Unconstitutional

Rural Danish Art Museum Acquires Painting By Artemisia Gentileschi

Dan Nadel Is Expanding American Art History, One Outlier at a Time

Bernard Arnault Says French Wealth Tax Will ‘Destroy’ the Economy

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models – Takara TLDR

Meta Allows US Allies to Access Llama AI Models for Military Training

Will OpenAI Really Build 60 Football Fields Worth of AI Infrastructure Per Week?

What's Hot

Qwen3-Omni Technical Report – Takara TLDR

Related Posts

Subscribe to Updates