Alibaba Group Holding’s AI and cloud computing unit on Wednesday released the Wan2.2-S2V tool, its latest open-source artificial intelligence model that generates expressive, film-quality character videos from a static image and an audio clip.
The new model forms part of Alibaba Cloud’s Wan2.2 family, which the company last month touted as the AI industry’s first open-source large video-generation models incorporating the so-called Mixture-of-Experts (MoE) architecture. Hangzhou-based Alibaba owns the Post.
Powered by advanced audio-driven animation technology, the Wan2.2-S2V model “delivers lifelike character performances, ranging from natural dialogue to musical performances, and seamlessly handles multiple characters within a scene”, Alibaba Cloud said on Wednesday.
Do you have questions about the biggest topics and trends from around the world? Get the answers with SCMP Knowledge, our new platform of curated content with explainers, FAQs, analyses and infographics brought to you by our award-winning team.
Wan2.2-S2V could be used by professional content creators to “capture precise visual representations tailored to specific storytelling and design requirements”, the company said. That enhancement was attributed to the model’s large-scale audiovisual data set tailored to film and television production scenarios, it added.
The latest Wan2.2 variant reflects how Chinese AI companies are continuing to narrow the gap with their US peers through the open-source approach, which makes the source code of AI models available for third-party developers to use, modify and distribute.
Alibaba Cloud’s Wan2.2 family was designed to meet the diverse needs of professional AI-generated content creators. Photo: Handout alt=Alibaba Cloud’s Wan2.2 family was designed to meet the diverse needs of professional AI-generated content creators. Photo: Handout>
Wan2.2-S2V can now be downloaded from online developer platforms Hugging Face and GitHub, as well as from Alibaba Cloud’s ModelScope open-source community.
Alibaba Cloud, which has become a major contributor to the global open-source community, said its Wan2.1 and Wan2.2 models have generated over 6.9 million downloads on Hugging Face and ModelScope.
Wan2.2’s MoE architecture divides the model into separate sub-networks, or “experts”, that specialise in a subset of the input data to jointly perform a task.
To meet the diverse needs of professional content creators, Wan2.2-S2V provides two output resolutions: standard-definition 480P and high-definition 720P, according to Alibaba Cloud. That ensures high-quality visuals suitable for both social media content and professional presentations.
A sample video generated via Wan2.2-S2V shows a female singing on an old vessel sailing through rough waves and stormy weather. Photo: Handout alt=A sample video generated via Wan2.2-S2V shows a female singing on an old vessel sailing through rough waves and stormy weather. Photo: Handout>
The new model also enables the creation of videos across multiple framing options including portrait, bust and full-body perspective, according to the company.
The team behind Wan2.2-S2V said in a report accompanying the new release that the model can generate long-form videos with consistent visual details. They did not specify the maximum video length that the model supports.
Wan2.2-S2V comes several months after TikTok owner ByteDance released its OmniHuman-1 multimodal model, which converted a combination of images and audio bites into realistic videos.
This article originally appeared in the South China Morning Post (SCMP), the most authoritative voice reporting on China and Asia for more than a century. For more SCMP stories, please explore the SCMP app or visit the SCMP’s Facebook and Twitter pages. Copyright © 2025 South China Morning Post Publishers Ltd. All rights reserved.
Copyright (c) 2025. South China Morning Post Publishers Ltd. All rights reserved.