COpen

CogVideoX-5B

Tsinghua / Zhipu AI

Lightweight entry point. 12GB GPU, Apache 2.0. Most accessible for experimentation.

AOpen

Wan 2.2

Alibaba (Tongyi Lab)

MoE architecture with 27B total params but only 14B active. Trained on 65% more images and 83% more video than 2.1. Outperforms leading closed-source models on Wan-Bench 2.0.

Pick CogVideoX-5B if…

You want budget prototyping, research, or motion-heavy short clips.

Pick Wan 2.2 if…

You want cinematic style control, speech-to-video, or consumer GPU deployment (TI2V-5B).

Specifications

Maker
Tsinghua / Zhipu AI
Alibaba (Tongyi Lab)
Source Type
Open Source
Open Source
License
Apache 2.0
Apache 2.0
Architecture
Expert Transformer
DiT + MoE (2-expert: high-noise + low-noise)
Parameters
5B
27B total (14B active per step, 2x14B experts)
Max Resolution
720p (1360x768)
720p
Max Duration
6s
10-15s
FPS
8
24
Native Audio
No
No
ComfyUI Support
Yes
Yes
Fine-tunable
Yes
Yes
Min VRAM
12GB
8GB (small) / 24GB (full)
Cost / Second
Self-host
Self-host
Inputs
T2V
T2V (A14B), I2V (A14B), TI2V (5B), S2V (14B)
On Floyo
No
Yes

Strengths & Trade-offs

CogVideoX-5B

Strengths

  • +Lightweight 12GB
  • +Apache 2.0
  • +LoRA + DDIM Inverse
  • +ModelScope integration

Trade-offs

  • -Low FPS (8fps)
  • -6s max
  • -older architecture

Best For

  • Budget prototyping
  • research
  • motion-heavy short clips

Wan 2.2

Strengths

  • +First MoE in video diffusion
  • +27B total but only 14B active per step
  • +high-noise expert for layout + low-noise for detail
  • ++65.6% more images and +83.2% more video training data vs 2.1
  • +cinematic aesthetic control (lighting, composition, contrast, color tone)

Trade-offs

  • -720p cap
  • -MoE needs careful threshold tuning (SNR-based)
  • -no native audio in base model (S2V is separate)
  • -newer ecosystem than 2.1

Best For

  • Self-hosted production
  • cinematic style control
  • speech-to-video
  • consumer GPU deployment (TI2V-5B)