On September 8, 2025, Alibaba’s Qwen team introduced Qwen3-ASR Flash, an automatic speech recognition (ASR) system covering 11 languages — as well as multiple dialects and accents — and a range of acoustic conditions, positioned as an all-in-one transcription service.
Unlike conventional systems that require separate models for different languages or conditions, Qwen3-ASR Flash consolidates capabilities into a single API-based model. It supports Mandarin, Cantonese, Sichuanese, Hokkien, Wu, and English (with British and American accents), alongside French, German, Spanish, Italian, Portuguese, Russian, Japanese, Korean, and Arabic.
The system is also trained to reject non-speech inputs such as background noise or silence, a feature that enhances transcription accuracy in real-world use cases.
A distinguishing aspect of Qwen3-ASR Flash is its ability to integrate contextual biasing directly into recognition. Users can provide background text in almost any format, from keyword lists to full paragraphs or entire documents, to steer the model toward domain-specific terms and phrasing. “Qwen3-ASR-Flash eliminates the need for preprocessing of contextual information,” Alibaba noted.
The model is also designed to maintain accuracy under difficult conditions, including far-field recordings, noisy environments such as vehicles or crowded venues, and even music-heavy scenarios. In fact, one of its more unusual capabilities is the ability to transcribe singing voices with background music, a task that has challenged earlier systems.
Alibaba demonstrated the model’s range with examples from rap lyrics, match commentary, chemistry lectures, and multilingual speech. Benchmarks place its word error rate below 8%, including in music and noisy environments, surpassing Google’s Gemini-2.5-Pro, OpenAI’s GPT-4o-Transcribe, and ByteDance’s Doubao-ASR.
You can test it here: https://huggingface.co/spaces/Qwen/Qwen3-ASR-Demo
The launch has drawn attention from users on Reddit, X, where users praised the performance. However, some noted that it is currently API-only, with no open-weight release, describing it as a “tough sell” compared to Whisper.
As one Reddit user observed: “Unless this model provides word-level timestamps, diarization, or confidence scores, it’s going to be a tough sell.”
Alibaba has not confirmed whether Qwen3-ASR will follow other Qwen models with an eventual open release. For now, the system reflects a state-of-the-art but closed deployment model, contrasting with open alternatives like OpenAI’s Whisper, Nvidia’s Parakeet, or Mistral AI’s Voxtral.