Leaderboard

Rankings of video generation models on wmbench evaluations (human evaluation, 250 prompts, 1–5 scale).

Human evaluation (paper)

Method. Per-annotator scores → per-video mean → per-model mean across 250 prompts. Overall combines the General and Domain views with equal weight (domain mean is weighted by the number of (video, law) means contributing to each domain). Bold marks the best score in each column; underline marks the second-best. denotes a closed-source model.

Model General Physics ↑ Domain Physics ↑ Overall ↑
SAPTVPersist. Solid-BodyFluidOptical
Wan2.2-27B-A14B 3.10 3.37 3.50 3.23 3.18 3.55 3.28
Veo-3.1 3.26 3.29 3.42 3.12 3.65 3.69 3.28
OmniWeaving 2.97 3.17 3.34 2.98 3.26 3.22 3.10
Cosmos-14B 2.66 2.98 3.20 2.81 2.98 3.38 2.91
LTX-2.3-22B 2.58 2.74 2.72 2.62 3.07 2.98 2.69
Wan2.2-TI2V-5B 2.44 2.68 2.78 2.58 2.71 2.99 2.63
Cosmos-2B 2.33 2.56 2.77 2.53 2.77 3.35 2.58
LTX-2-19B 2.56 2.64 2.48 2.46 3.04 2.78 2.56

8 models · generated 2026-05-10T19:38:22+00:00

PhyJudge-9B (auto-eval)

Method. Per-video scores from PhyJudge-9B (Qwen3.5-9B finetune, sub-question + human prompt) → per-model mean across 250 prompts. Overall combines the General and Domain views with equal weight (domain mean is weighted by the number of (video, law) scores contributing to each domain). Bold marks the best score in each column; underline marks the second-best. denotes a closed-source model.

Model General Physics ↑ Domain Physics ↑ Overall ↑
SAPTVPersist. Solid-BodyFluidOptical
Veo-3.1 3.03 3.10 3.11 2.90 3.33 3.70 3.05
Cosmos-14B 2.79 2.96 3.26 2.90 3.14 3.09 2.97
Wan2.2-27B-A14B 2.78 2.97 3.08 2.84 3.04 3.36 2.92
Cosmos-2B 2.60 2.73 3.07 2.72 2.92 3.57 2.80
OmniWeaving 2.68 2.73 2.92 2.71 2.99 3.13 2.78
LTX-2.3-22B 2.63 2.79 2.91 2.55 3.02 3.21 2.72
Wan2.2-TI2V-5B 2.48 2.70 2.76 2.61 3.01 3.45 2.68
LTX-2-19B 2.50 2.62 2.79 2.49 3.01 3.09 2.62

8 models · generated 2026-05-10T19:48:53+00:00