📬 Want your model on the leaderboard? We help evaluate for free! Submit here →

WBench

A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Kaining Ying*, Hengrui Hu*, Siyu Ren, Jiamu Li, Fengjiao Chen,
Ziwen Wang, Xuezhi Cao, Xunliang Cai, Henghui Ding

Fudan University    Meituan LongCat Team

*Equal contribution    Corresponding author: hhding@fudan.edu.cn

📌
289
Test Cases
👁️ 2 Perspectives🧑 5 Subject types🌍 6 Scene categories🎨 Multiple styles
📎
1,058
Interaction Turns
🧭 158 Navigation🎬 65 Event Edit🏃 76 Subject Action🔄 31 Perspective Switch
🔖
22
Metrics
🎥 6 Video Quality🎯 2 Setting🕹️ 4 Interaction🔗 8 Consistency⚙️ 2 Physical✅ Human-validated
🏷️
20
Models
📝 9 Text📷 5 Camera🎮 6 Action

TL;DR

Interactive world models are advancing rapidly, yet no unified standard exists for systematic evaluation. WBench fills this gap with 289 multi-turn cases across 5 dimensions — evaluating 20 models with 22 metrics validated against human judgments. We find that no single model dominates all dimensions.

Multi-turn
Navi + action + event + PS
Navigation
W/A/S/D/left/right/up/down
Subject Action
Character action
Event Editing
Environment change
Perspective Switch
FP ↔ TP

Key Findings

1🏆
No model dominates all dimensions. Among 20 evaluated models, including commercial APIs (Kling 3.0, Seedance 1.5), open-source models (Wan 2.7, HY-Video, Cosmos), and closed-source beta world models (Genie 3, Happy Oyster, HY-World), each excels in different aspects. Kling 3.0 leads overall but lags in Consistency; HY-Video ranks 1st in Consistency among text-conditioned models but struggles with Interaction; world models like Happy Oyster and HY-World dominate Navigation yet underperform in Video Quality.
2🧭
Navigation is largely independent of other capabilities. Among text-conditioned models, YUME 1.5 achieves the highest navigation score (72.0) yet ranks near bottom on event editing (57.8) and perspective switching (16.7). Conversely, Wan 2.7 leads in event editing (84.0) and subject action (83.4) but scores only 66.0 on navigation. This suggests navigation and semantic interaction require fundamentally different internal representations.
3🎬
Camera control does not imply subject control. Camera-conditioned world models (InSpatio, LingBot, HY-World) achieve high perspective consistency and navigation scores, but action-conditioned models (Genie 3, Happy Oyster, MatrixGame) better handle perspective switch. The two control paradigms remain orthogonal.
4
Physical correctness follows rendering quality, not control ability. Models with higher video quality tend to produce more physically plausible outputs (correlation ρ=0.82), while control ability (navigation, interaction) shows near-zero correlation with physics scores, suggesting physics emerges from visual fidelity rather than world understanding.
5🔄
Multi-turn interactions compound errors. Navigation accuracy drops -21 points from turn 1 to turn 4 as errors compound across steps. Dedicated world models (HY-World) degrade much less than text-conditioned models (Kling 3.0), suggesting explicit geometric control better preserves spatial state than text-based prompting.

Leaderboard

Click column headers to sort. Scores 0–100, higher is better. # = per-metric rank.

QualitySettingInteractionConsistencyPhysicalHover abbreviations for full names
Split:
Metric:
Type:
#ModelAverage ↕Quality ↕Setting ↕Interaction ↕Consistency ↕Physical ↕
🥇Kling 3.0
Kling AI · API
79.283.091.070.382.569.3
🥈LingBot-World
Ant Group · Open Source
78.881.572.679.888.971.2
🥉Wan 2.7
Alibaba · Open Source
78.582.691.466.080.571.8
4HY-World 1.5
Tencent · Open Source
78.480.272.287.586.066.3
5HY-Video 1.5
Tencent · Open Source
78.279.785.671.886.767.4
6Happy Oyster
Alibaba · Web
77.179.374.285.183.363.5
7Seedance 1.5
ByteDance · API
76.583.282.968.080.268.4
8Cosmos 2.5
NVIDIA · Open Source
75.275.683.364.185.667.4
9LTX 2.3
Lightricks · Open Source
74.478.785.267.675.664.9
10InSpatio-World
InSpatio · Open Source
74.374.971.472.887.465.2
11Fantasy-World
Amap · Open Source
74.275.571.372.185.366.8
12Genie 3
Google · Web
74.177.472.573.381.465.7
13LongCat-Video
Meituan · Open Source
73.778.272.363.185.968.9
14YUME 1.5
Shanghai AI Lab · Open Source
73.579.572.472.078.665.2
15Infinite-World
Nankai · Open Source
72.978.769.375.978.762.1
16MatrixGame3
Skywork · Open Source
71.276.963.683.572.959.3
17Kairos 3.0
SenseTime · Open Source
70.776.470.365.181.460.4
18HY-GameCraft
Tencent · Open Source
68.574.966.667.870.662.4
19MatrixGame2
Skywork · Open Source
68.575.767.180.662.057.2
20Astra
Tsinghua · Open Source
64.069.759.667.771.651.4

Evaluation Metrics

22 metrics across 5 dimensions. All scores normalized to 0–100, higher is better.

Video Quality (7)

  • Aesthetic VBench aesthetic scorer
  • Imaging VBench technical quality
  • Flickering Inter-frame brightness stability
  • Dynamic RAFT optical flow magnitude
  • Smoothness RAFT flow consistency
  • Background CLIP background similarity
  • HPSv3 Human Preference Score v3

Setting Adherence (2)

  • Scene VLM: scene elements vs. environment prompt
  • Subject VLM: appearance/action vs. character prompt

Interaction (4)

  • Navigation MegaSAM pose estimation, NavScore = (Acc+Con)/2
  • Event Edit VLM: environment changes vs. instruction
  • Subject Action VLM: action execution correctness
  • Perspective Switch VLM: FP↔TP transition accuracy

Consistency (8)

  • Spatial DreamSim: first vs. last frame after loop
  • Gated Spatial Spatial × dynamic degree gate
  • Perspective SAM2 mask centroid stability
  • Segment TransNetV2 shot-boundary; 1-cut_rate
  • Geometric DA3 depth reprojection error
  • Photometric DA3 pixel reprojection PSNR
  • Subject DINOv2+CLIP masked similarity
  • Background CLIP background region similarity

Physical (2)

  • Visual Plausibility Tuned VLM-based plausibility regressor (0-5→0-100)
  • Causal Fidelity VLM: physical cause-effect correctness

Metric Comparison

Same case, different models — see how metrics capture quality differences.

Statistics

289 cases, 4 interaction types, 6 scene categories, 5 subject types.

Perspective
FPPTPP
Interaction
NavigationSubject ActionEvent EditPS
SA Type
LocomotionManipulationTool UseNPC
EE Type
EnvironmentAppearanceObj.MechObj.PhysObj.Natural
Subject
HumanAnimalRobotVehicleOther
Scene
NatureUrbanIndoorWorkFantasySports
Style
PhotorealisticStyled
Turns
4 turns5+32