Leaderboard

Rankings of video generation models on wmbench evaluations (human evaluation, 250 prompts, 1–5 scale).

Human evaluation (paper)

Method. Per-annotator scores → per-video mean → per-model mean across 250 prompts. Overall combines the General and Domain views with equal weight (domain mean is weighted by the number of (video, law) means contributing to each domain). Bold marks the best score in each column; underline marks the second-best. † denotes a closed-source model.

Model	General Physics ↑			Domain Physics ↑			Overall ↑
Model	SA	PTV	Persist.	Solid-Body	Fluid	Optical	Overall ↑
Wan2.2-27B-A14B	3.10	3.37	3.50	3.23	3.18	3.55	3.28
Veo-3.1^†	3.26	3.29	3.42	3.12	3.65	3.69	3.28
OmniWeaving	2.97	3.17	3.34	2.98	3.26	3.22	3.10
Cosmos-14B	2.66	2.98	3.20	2.81	2.98	3.38	2.91
LTX-2.3-22B	2.58	2.74	2.72	2.62	3.07	2.98	2.69
Wan2.2-TI2V-5B	2.44	2.68	2.78	2.58	2.71	2.99	2.63
Cosmos-2B	2.33	2.56	2.77	2.53	2.77	3.35	2.58
LTX-2-19B	2.56	2.64	2.48	2.46	3.04	2.78	2.56

8 models · generated 2026-05-10T19:38:22+00:00

PhyJudge-9B (auto-eval)

Method. Per-video scores from PhyJudge-9B (Qwen3.5-9B finetune, sub-question + human prompt) → per-model mean across 250 prompts. Overall combines the General and Domain views with equal weight (domain mean is weighted by the number of (video, law) scores contributing to each domain). Bold marks the best score in each column; underline marks the second-best. † denotes a closed-source model.

Model	General Physics ↑			Domain Physics ↑			Overall ↑
Model	SA	PTV	Persist.	Solid-Body	Fluid	Optical	Overall ↑
Veo-3.1^†	3.03	3.10	3.11	2.90	3.33	3.70	3.05
Cosmos-14B	2.79	2.96	3.26	2.90	3.14	3.09	2.97
Wan2.2-27B-A14B	2.78	2.97	3.08	2.84	3.04	3.36	2.92
Cosmos-2B	2.60	2.73	3.07	2.72	2.92	3.57	2.80
OmniWeaving	2.68	2.73	2.92	2.71	2.99	3.13	2.78
LTX-2.3-22B	2.63	2.79	2.91	2.55	3.02	3.21	2.72
Wan2.2-TI2V-5B	2.48	2.70	2.76	2.61	3.01	3.45	2.68
LTX-2-19B	2.50	2.62	2.79	2.49	3.01	3.09	2.62

8 models · generated 2026-05-10T19:48:53+00:00