Alibaba has announced the release of Wan2.2, an open-source suite of large video generation models based on the Mixture-of-Experts (MoE) architecture.
Model capabilities
The Wan2.2 series includes the text-to-video model Wan2.2-T2V-A14B, the image-to-video model Wan2.2-I2V-A14B, and a hybrid model Wan2.2-TI2V-5B that supports both text-to-video and image-to-video generation in a unified framework. Each model has been designed with a focus on increasing the quality, efficiency and level of user control in generating cinematic-style videos from prompts or images.
Both Wan2.2-T2V-A14B and Wan2.2-I2V-A14B leverage the MoE architecture and use data curated for cinematic aesthetics. These models enable creators to adjust multiple video properties such as lighting, time of day, colour tone, camera angle, frame size, composition, and focal length. According to Alibaba, the models are capable of creating complex movements, including detailed facial expressions and elaborate sports scenes, while following instructions and physical rules more closely than before.
To address computational efficiency in video generation, particularly with long tokens, Wan2.2-T2V-A14B and Wan2.2-I2V-A14B employ a two-expert design throughout the denoising process of diffusion models. One expert focuses on the scene layout under high noise, while the other refines details under low noise conditions. The models operate with a total of 27 billion parameters but only activate 14 billion parameters per step, which the company claims reduces computational consumption by up to half.
Aesthetic tuning
Wan2.2 introduces a cinematic-inspired prompt system that allows users to shape results based on key categories such as lighting, illumination, composition, and colour tone. The company says this approach enables more accurate interpretation and delivery of users’ aesthetic demands throughout the video generation task.
Alibaba has expanded the dataset for Wan2.2, reporting a 65.6% increase in image data and an 83.2% increase in video data compared to the previous version, Wan2.1. This increased dataset is intended to strengthen generalisation and creative diversity, allowing the models to produce more intricate scenes and showcase greater artistic range.
Hybrid model and efficiency
The hybrid model, Wan2.2-TI2V-5B, introduces a dense approach built on a 3D Variational Autoencoder (VAE) architecture, featuring a temporal and spatial compression ratio of 4x16x16. This results in an information compression rate of 64. Alibaba states that the TI2V-5B can generate a five-second, 720P video in several minutes on a single consumer-grade GPU.
“The TI2V-5B can generate a 5-second 720P video in several minutes on a single consumer-grade GPU, enabling efficiency and scalability to developers and content creators.”
Open-source and community engagement
All Wan2.2 models are available for download on Hugging Face, GitHub, and Alibaba Cloud’s open-source platform, ModelScope. Alibaba reports that since open-sourcing four Wan2.1 models in February 2025 and Wan 2.1-VACE in May 2025, its models have collectively achieved over 5.4 million downloads on Hugging Face and ModelScope.
“A major contributor to the global open source community, Alibaba open sourced four Wan2.1 models in February 2025 and Wan 2.1-VACE (Video All-in-one Creation and Editing) in May 2025. To date, the models have attracted over 5.4 million downloads on Hugging Face and ModelScope.”
Alibaba’s release of Wan2.2 underscores its continued activity within the open-source ecosystem and the ongoing development of video generation models aimed at supporting creators and developers globally.