AOpen

Wan 2.2

Alibaba (Tongyi Lab)

MoE architecture with 27B total params but only 14B active. Trained on 65% more images and 83% more video than 2.1. Outperforms leading closed-source models on Wan-Bench 2.0.

AOpen

Wan 2.1

Alibaba (Tongyi Lab)

The foundation that started it all. 1.3B variant runs on virtually any GPU. First open model to beat closed-source across benchmarks.

Pick Wan 2.2 if…

You want cinematic style control, speech-to-video, or consumer GPU deployment (TI2V-5B).

Pick Wan 2.1 if…

You want consumer GPU workflows, academic research, or Chinese + English text-in-video.

Specifications

Maker
Alibaba (Tongyi Lab)
Alibaba (Tongyi Lab)
Source Type
Open Source
Open Source
License
Apache 2.0
Apache 2.0
Architecture
DiT + MoE (2-expert: high-noise + low-noise)
Flow Matching DiT + 3D Causal VAE
Parameters
27B total (14B active per step, 2x14B experts)
14B (also 1.3B variant)
Max Resolution
720p
720p
Max Duration
10-15s
5s
FPS
24
24
Native Audio
No
No
ComfyUI Support
Yes
Yes
Fine-tunable
Yes
Yes
Min VRAM
8GB (small) / 24GB (full)
8.19GB (1.3B) / 24GB+ (14B)
Cost / Second
Self-host
Self-host
Inputs
T2V (A14B), I2V (A14B), TI2V (5B), S2V (14B)
T2V (14B/1.3B), I2V (14B), FLF2V, VACE, V2A
On Floyo
Yes
Yes

Strengths & Trade-offs

Wan 2.2

Strengths

  • +First MoE in video diffusion
  • +27B total but only 14B active per step
  • +high-noise expert for layout + low-noise for detail
  • ++65.6% more images and +83.2% more video training data vs 2.1
  • +cinematic aesthetic control (lighting, composition, contrast, color tone)

Trade-offs

  • -720p cap
  • -MoE needs careful threshold tuning (SNR-based)
  • -no native audio in base model (S2V is separate)
  • -newer ecosystem than 2.1

Best For

  • Self-hosted production
  • cinematic style control
  • speech-to-video
  • consumer GPU deployment (TI2V-5B)

Wan 2.1

Strengths

  • +SOTA open-source at launch
  • +1.3B model runs on any consumer GPU (8.19GB VRAM)
  • +first video model with Chinese + English text generation
  • +Wan-VAE encodes unlimited-length 1080P
  • +T2V/I2V/Video Editing/T2I/V2A all supported

Trade-offs

  • -720p max
  • -5s duration
  • -1.3B quality limited
  • -no native audio generation
  • -superseded by 2.2 on quality

Best For

  • Budget local deployment
  • consumer GPU workflows
  • academic research
  • Chinese + English text-in-video