WBench - Interactive World Model Benchmark

A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Kaining Ying^*, Hengrui Hu^*, Siyu Ren, Jiamu Li, Fengjiao Chen,
Ziwen Wang, Xuezhi Cao, Xunliang Cai, Henghui Ding^†

Fudan University Meituan LongCat Team

^*Equal contribution ^†Corresponding author: hhding@fudan.edu.cn

📌

289

Test Cases

👁️ 2 Perspectives🧑 5 Subject types🌍 6 Scene categories🎨 Multiple styles

📎

1,058

Interaction Turns

🧭 601 Navigation🎬 183 Event Edit🏃 213 Subject Action🔄 61 Perspective Switch

🔖

Metrics

🎥 6 Video Quality🎯 2 Setting🕹️ 4 Interaction🔗 8 Consistency⚙️ 2 Physical✅ Human-validated

🏷️

Models

📝 9 Text📷 7 Camera🎮 6 Action

TL;DR

Interactive world models are advancing rapidly, yet no unified standard exists for systematic evaluation. WBench fills this gap with 289 multi-turn cases across 5 dimensions — evaluating 23 models with 22 metrics validated against human judgments. We find that no single model dominates all dimensions.

Multi-turn

Navi + action + event + PS

Navigation

W/A/S/D/left/right/up/down

Subject Action

Character action

Event Editing

Environment change

Perspective Switch

FP ↔ TP

Key Findings

📌

No model dominates all dimensions. Among 22 evaluated models, including commercial APIs (Kling 3.0, Seedance 1.5, Wan 2.7), open-source models (HY-Video 1.5, Cosmos 2.5, HY-World 1.5), and closed-source beta world models (Genie 3, Happy Oyster), each excels in different aspects. Kling 3.0 leads overall but lags in Consistency; HY-Video 1.5 ranks 1st in Consistency among text-conditioned models but struggles with Interaction; world models like Happy Oyster and HY-World 1.5 dominate Navigation yet underperform in Video Quality.

📌

Navigation is largely independent of other capabilities. Among text-conditioned models, YUME 1.5 achieves the highest navigation score (72.0) yet ranks near bottom on event editing (57.8) and perspective switching (16.7). Conversely, Wan 2.7 leads in event editing (84.0) and subject action (83.4) but scores only 66.0 on navigation. This suggests navigation and semantic interaction require fundamentally different internal representations.

📌

Camera control does not imply subject control. Navigation accuracy and perspective consistency are two distinct capabilities that most models fail to achieve simultaneously. For example, HY-World 1.5 ranks 1st in navigation (87.5) but only scores 62.5 in perspective consistency; conversely, LingBot-World achieves the highest perspective consistency (90.9) but lags in navigation (79.8).

📌

Physical correctness follows rendering quality, not control ability. Models with higher video quality tend to produce more physically plausible outputs (correlation ρ=0.82), while control ability (navigation, interaction) shows near-zero correlation with physics scores.

📌

Multi-turn interactions compound errors. Navigation accuracy drops -21 points from turn 1 to turn 4 as errors compound across steps. Dedicated world models (HY-World 1.5) degrade much less than text-conditioned models (Kling 3.0), suggesting explicit geometric control better preserves spatial state than text-based prompting.

Leaderboard

Click column headers to sort. Scores 0–100, higher is better. # = per-metric rank.

QualitySettingInteractionConsistencyPhysicalHover abbreviations for full names

#	Model	Average ↕	Quality ↕	Setting ↕	Interaction ↕	Consistency ↕	Physical ↕
🥇	Kling 3.0 Kling AI · API	79.1	81.4	91.0	70.3	83.7	69.3
🥈	LingBot-World base-camera Ant Group · Open Source	78.5	78.9	72.6	79.8	89.9	71.2
🥉	Wan 2.7 Alibaba · API	78.5	81.5	91.4	66.0	81.6	71.8
4	HY-World 1.5 ar-distill Tencent · Open Source	78.2	78.1	72.2	87.5	86.9	66.3
5	HY-Video 1.5 Tencent · Open Source	78.0	77.6	85.6	71.8	87.4	67.4
6	LingBot-World fast Ant Group · Open Source	77.5	79.4	77.9	79.4	84.9	65.7
7	Happy Oyster Alibaba · Web	76.9	77.3	74.2	85.1	84.3	63.5
8	Seedance 1.5 ByteDance · API	76.5	82.1	82.9	68.0	81.3	68.4
9	Lyra 2.0 4-step AR NVIDIA · Open Source	76.3	77.1	73.2	85.4	79.3	66.7
10	SANA-WM 4-step AR NVIDIA · Open Source	76.0	79.3	76.1	82.1	80.7	61.9
11	DreamX-World 5B AR Amap · Open Source	75.0	77.5	80.8	78.4	74.9	63.3
12	Cosmos 2.5 NVIDIA · Open Source	74.8	72.9	83.3	64.1	86.5	67.4
13	LTX 2.3 Lightricks · Open Source	74.4	77.1	85.2	67.6	77.2	64.9
14	InSpatio-World InSpatio · Open Source	73.9	71.5	71.4	72.8	88.4	65.2
15	Genie 3 Google · Web	73.9	75.2	72.5	73.3	82.6	65.7
16	Fantasy-World Amap · Open Source	73.8	72.4	71.3	72.1	86.4	66.8
17	YUME 1.5 Shanghai AI Lab · Open Source	73.5	77.6	72.4	72.0	80.1	65.2
18	LongCat-Video Meituan · Open Source	73.4	75.4	72.3	63.1	87.1	68.9
19	Infinite-World Meituan · Open Source	72.9	77.0	69.3	75.9	80.0	62.1
20	MatrixGame3 Skywork · Open Source	71.3	75.5	63.6	83.5	74.5	59.3
21	Kairos 3.0 SenseTime · Open Source	70.5	74.0	70.3	65.1	82.6	60.4
22	MatrixGame2 Skywork · Open Source	68.8	73.8	67.1	80.6	65.1	57.2
23	HY-GameCraft Tencent · Open Source	68.5	73.0	66.6	67.8	72.6	62.4
24	Astra Tsinghua · Open Source	63.8	67.1	59.6	67.7	73.3	51.4

#	Model	Average ↕	Quality ↕	Setting ↕	Interaction ↕	Consistency ↕	Physical ↕
🥇	Kling 3.0 Kling AI · API	79.4	80.0	91.0	73.1	83.9	69.2
🥈	Wan 2.7 Alibaba · API	78.4	81.0	91.4	72.1	75.8	71.6
🥉	Seedance 1.5 ByteDance · API	76.2	81.9	82.9	68.3	79.9	68.2
4	HY-Video 1.5 Tencent · Open Source	74.3	76.6	85.6	54.7	87.5	67.1
5	LTX 2.3 Lightricks · Open Source	70.9	77.0	85.2	49.4	78.0	65.1
6	Cosmos 2.5 NVIDIA · Open Source	70.4	71.7	83.3	43.5	86.3	67.0
7	LongCat-Video Meituan · Open Source	69.9	77.2	72.3	45.1	86.6	68.4
8	YUME 1.5 Shanghai AI Lab · Open Source	68.9	77.6	72.4	48.4	80.9	65.4
9	Kairos 3.0 SenseTime · Open Source	65.7	73.1	70.3	41.6	83.2	60.5

#	Model	Aesth	Imag	Flick	Dyn	Smooth	HPSv3	Scene	Subj	Navi	BgCon	Spat	GSp	Persp	Seg	Geo	Photo	SubjC	VPlaus	CFid
🥇	Kling 3.0 Kling AI · API	63.0	68.1	93.2	97.5	97.6	69.1	89.0	92.9	70.3	92.3	75.2	75.1	76.8	93.0	88.9	79.9	88.5	60.7	78.0
🥈	LingBot-World base-camera Ant Group · Open Source	66.9	67.9	94.1	66.2	96.9	81.4	51.6	93.6	79.8	96.9	92.7	67.1	90.9	99.4	95.4	83.3	93.5	64.8	77.7
🥉	Wan 2.7 Alibaba · API	61.4	68.0	92.2	100.0	96.3	71.1	88.3	94.6	66.0	89.4	71.0	71.0	78.2	92.4	83.7	76.4	90.7	60.3	83.3
4	HY-World 1.5 ar-distill Tencent · Open Source	60.1	65.4	93.5	91.1	98.1	60.5	53.5	90.8	87.5	92.7	90.6	84.9	62.5	100.0	92.0	83.1	89.1	58.6	74.0
5	HY-Video 1.5 Tencent · Open Source	63.4	67.4	94.2	73.9	98.7	68.0	77.5	93.6	71.8	92.1	79.2	75.1	86.6	99.4	94.6	80.3	91.6	59.7	75.0
6	LingBot-World fast Ant Group · Open Source	62.6	63.8	92.4	95.6	96.0	65.7	63.4	92.4	79.4	90.9	77.2	76.9	82.8	98.1	85.4	79.1	88.6	58.8	72.5
7	Happy Oyster Alibaba · Web	56.6	63.9	94.0	94.2	97.0	58.3	57.4	91.1	85.1	91.4	77.7	75.8	75.0	96.2	87.2	79.8	91.5	57.6	69.3
8	Seedance 1.5 ByteDance · API	61.0	69.3	92.4	99.4	97.5	73.0	71.6	94.2	68.0	89.6	72.7	72.4	70.5	96.2	82.4	76.8	90.1	60.7	76.0
9	Lyra 2.0 4-step AR NVIDIA · Open Source	57.2	65.6	90.9	96.2	97.0	55.6	62.2	84.2	85.4	89.2	87.5	86.3	28.4	90.5	86.6	82.7	82.9	59.3	74.0
10	SANA-WM 4-step AR NVIDIA · Open Source	60.9	63.7	93.9	95.6	98.1	63.5	61.6	90.5	82.1	90.5	77.1	76.4	49.0	97.5	88.7	81.2	85.4	56.5	67.2
11	Cosmos 2.5 NVIDIA · Open Source	61.8	66.9	94.8	49.0	98.2	66.5	72.4	94.2	64.1	92.3	78.1	74.3	84.3	94.3	94.6	81.6	92.3	60.1	74.7
12	DreamX-World 5B AR Amap · Open Source	59.7	62.9	90.0	96.2	94.9	61.3	74.8	86.8	78.4	88.7	74.7	74.4	31.7	99.4	72.2	75.8	81.9	56.8	69.8
13	LTX 2.3 Lightricks · Open Source	57.9	61.0	93.2	98.1	96.4	56.1	81.3	89.2	67.6	88.3	70.2	70.2	69.8	75.8	76.9	79.2	87.2	55.7	74.0
14	InSpatio-World InSpatio · Open Source	64.4	67.6	96.0	26.1	98.8	76.1	51.7	91.1	72.8	95.0	93.8	66.5	72.5	100.0	97.3	87.4	94.4	63.1	67.3
15	Fantasy-World Amap · Open Source	63.0	62.8	95.8	49.0	97.9	65.8	52.4	90.1	72.1	94.2	80.6	64.2	79.8	100.0	95.3	84.8	92.5	59.7	74.0
16	Genie 3 Google · Web	51.6	59.3	95.0	92.4	97.8	55.2	61.1	83.8	73.3	90.7	79.9	78.4	54.5	93.6	88.6	84.5	90.4	59.7	71.7
17	LongCat-Video Meituan · Open Source	66.5	69.6	94.8	45.9	97.9	77.6	53.1	91.5	63.1	95.1	83.3	66.2	81.5	99.4	95.4	82.2	93.4	61.8	76.0
18	YUME 1.5 Shanghai AI Lab · Open Source	58.7	63.3	93.0	96.8	97.0	57.0	53.1	91.7	72.0	90.3	71.5	71.4	48.0	99.4	88.0	83.3	88.8	57.7	72.7
19	Infinite-World Meituan · Open Source	58.7	66.1	94.1	82.8	98.0	62.3	54.0	84.5	75.9	88.8	74.9	74.4	33.8	100.0	94.3	85.1	88.4	57.2	67.0
20	MatrixGame3 Skywork · Open Source	46.4	70.0	86.3	97.5	95.4	57.1	48.9	78.4	83.5	85.7	81.0	80.4	13.3	89.8	87.6	75.3	83.0	54.0	64.7
21	Kairos 3.0 SenseTime · Open Source	59.9	62.7	95.4	70.1	97.5	58.5	52.2	88.5	65.1	91.1	76.8	62.0	76.3	94.3	89.0	80.8	90.8	58.0	62.7
22	HY-GameCraft Tencent · Open Source	52.6	58.7	93.7	96.8	97.6	38.3	50.6	82.5	67.8	86.5	60.5	60.5	17.9	99.4	88.3	85.0	82.6	56.5	68.3
23	MatrixGame2 Skywork · Open Source	54.0	60.3	94.6	94.9	98.2	41.0	49.4	84.9	80.6	86.9	64.5	64.5	29.2	21.0	86.1	81.3	87.2	55.0	59.3
24	Astra Tsinghua · Open Source	48.6	52.5	96.0	79.6	97.7	28.0	43.4	75.9	67.7	85.3	64.7	63.3	30.0	86.6	85.6	87.5	83.5	54.6	48.3

#	Model	Aesth	Imag	Flick	Dyn	Smooth	HPSv3	Scene	Subj	Navi	EE	SA	PS	BgCon	Spat	GSp	Persp	Seg	Geo	Photo	SubjC	VPlaus	CFid
🥇	Kling 3.0 Kling AI · API	61.3	67.7	94.5	89.9	97.9	68.8	89.0	92.9	70.3	81.4	85.6	55.0	92.7	75.2	75.1	76.8	92.7	89.4	80.4	88.5	60.4	78.0
🥈	Wan 2.7 Alibaba · API	59.6	68.1	93.0	99.3	96.5	69.4	88.3	94.6	66.0	84.0	83.4	55.0	89.5	71.0	71.0	62.2	65.6	82.6	75.5	88.7	59.8	83.3
🥉	Seedance 1.5 ByteDance · API	59.7	69.8	93.4	98.3	97.6	72.9	71.6	94.2	68.0	80.4	80.0	45.0	89.6	72.7	72.4	62.7	92.4	83.5	76.7	89.3	60.5	76.0
4	HY-Video 1.5 Tencent · Open Source	61.9	67.4	95.5	68.8	98.8	67.5	77.5	93.6	71.8	63.8	55.6	27.6	92.4	79.2	75.1	86.6	99.3	94.4	81.4	91.5	59.3	75.0
5	LTX 2.3 Lightricks · Open Source	56.9	62.3	94.1	94.4	96.8	57.7	81.3	89.2	67.6	53.0	51.8	25.0	89.3	70.2	70.2	69.8	77.8	81.1	79.4	86.7	56.2	74.0
6	Cosmos 2.5 NVIDIA · Open Source	60.1	67.2	96.0	42.4	98.3	65.9	72.4	94.2	64.1	48.2	41.6	20.0	92.3	78.1	74.3	84.3	93.1	94.2	82.1	91.8	59.3	74.7
7	LongCat-Video Meituan · Open Source	64.7	69.8	94.9	59.7	97.7	76.3	53.1	91.5	63.1	50.4	48.4	18.3	94.7	83.3	66.2	81.5	98.6	94.7	81.5	92.4	60.8	76.0
8	YUME 1.5 Shanghai AI Lab · Open Source	59.3	65.7	94.8	86.1	97.7	62.0	53.1	91.7	72.0	57.8	47.0	16.7	92.0	71.5	71.4	48.0	99.3	91.1	84.1	89.4	58.1	72.7
9	Kairos 3.0 SenseTime · Open Source	58.4	63.6	96.3	63.5	97.9	58.8	52.2	88.5	65.1	46.8	41.4	13.3	91.8	76.8	62.0	76.3	94.1	91.5	82.1	90.7	58.2	62.7

Evaluation Metrics

22 metrics across 5 dimensions. All scores normalized to 0–100, higher is better.

Video Quality (7)

Aesthetic VBench aesthetic scorer
Imaging VBench technical quality
Flickering Inter-frame brightness stability
Dynamic RAFT optical flow magnitude
Smoothness RAFT flow consistency
Background CLIP background similarity
HPSv3 Human Preference Score v3

Setting Adherence (2)

Scene VLM: scene elements vs. environment prompt
Subject VLM: appearance/action vs. character prompt

Interaction (4)

Navigation MegaSAM pose estimation, NavScore = (Acc+Con)/2
Event Edit VLM: environment changes vs. instruction
Subject Action VLM: action execution correctness
Perspective Switch VLM: FP↔TP transition accuracy

Consistency (8)

Spatial DreamSim: first vs. last frame after loop
Gated Spatial Spatial × dynamic degree gate
Perspective SAM2 mask centroid stability
Segment TransNetV2 shot-boundary; 1-cut_rate
Geometric DA3 depth reprojection error
Photometric DA3 pixel reprojection PSNR
Subject DINOv2+CLIP masked similarity
Background CLIP background region similarity

Physical (2)

Visual Plausibility Tuned VLM-based plausibility regressor
Causal Fidelity VLM: physical cause-effect correctness

Metric Comparison

Same case, different models — see how metrics capture quality differences.

Video Quality (6 sub-metrics)

Prompt: A modern city street in clear daylight. A Shiba Inu with tan-and-cream fur trots forward on a broad asphalt road lined with storefronts.
Interactions: W → right → right → left

Astra

Aesthetic 43.6Imaging 52.5HPSv3 28.0Dynamic 79.6Smoothness 97.7Flickering 96.0

LingBot-World

Aesthetic 66.9Imaging 67.9HPSv3 81.4Dynamic 66.2Smoothness 96.9Flickering 94.1

Setting Adherence (2 sub-metrics)

Prompt: A realistic basketball court, third-person view locked onto player #12 in red, tracking movement across polished hardwood floor with clean markings. Other players in red and blue uniforms move around. Hoop and backboard at far end under even indoor lighting.
Interactions: W → left → S

MatrixGame 3.0

Scene 30.0Subject 20.0

Kling 3.0

Scene 100.0Subject 100.0

Navigation Trajectory (3 sub-metrics)

Prompt: A third-person realistic scene on a dirt path between grapevine rows in a Tuscan vineyard in the afternoon. A man in a white linen shirt walks forward.
Interactions: W → D → S → A

Happy Oyster

NavScore 94.8Accuracy 98.5Consistency 91.2

Genie 3

NavScore 32.2Accuracy 57.5Consistency 6.9

Navigation Trajectory (3 sub-metrics)

Prompt: First-person view inside a neoclassical museum gallery, facing deeper into the hall. A wide marble staircase descends behind, a broad archway opens into an adjacent exhibition hall.
Interactions: left → left → right → right

LongCat-Video

NavScore 56.2Accuracy 19.9Consistency 92.6

MatrixGame 3.0

NavScore 81.4Accuracy 90.0Consistency 72.7

Subject Action Adherence

Prompt: An outdoor concrete basketball court, first-person view.
Action (Turn 1): Dribble the basketball with the right hand, bouncing it on the ground several times in place.

YUME 1.5

Score 35.0

Kling 3.0

Score 95.0

Event Edit Adherence

Prompt: A grand magical library interior in CG style. A young wizard in purple hooded robe, seen from behind.
Event (Turn 1): The wizard picks up the crystal staff.

LongCat-Video

Score 8.0

Wan 2.7

Score 96.0

Perspective Switch Adherence

Prompt: Large wooden sailing ship at sea during a storm. A pirate captain in a long dark coat with a tricorn hat.
Switch: TP → FP → TP

LongCat-Video

Score 0.0

Wan 2.7

Score 100.0

Spatial Consistency & Gated Spatial Consistency

Prompt: A retro 1980s arcade with rows of colorful game cabinets. Neon strip lights in pink and blue.
Interactions: A → A → D → D (loop trajectory)

HY-GameCraft

Spatial 8.8Gated 66.8Dynamic 100.0

HY-World 1.5

Spatial 87.4Gated 87.4Dynamic 100.0

Fantasy-World

Spatial 93.6Gated 7.0Dynamic 7.0

Note: Spatial Consistency measures frame similarity after a full loop (returning to start). Static videos can hack this metric, so Gated Spatial Consistency weights by motion magnitude — low-dynamic videos get penalized.

Physical: Collision

Prompt: Outdoor basketball court on a sunny afternoon. A standard orange basketball resting on the concrete surface.
Interactions: A → A → D → D

Kairos 3.0

Causal Fidelity 10.0Visual Plausibility 60.0

Genie 3

Causal Fidelity 100.0Visual Plausibility 80.0

Physical: Surface Interaction

Prompt: Antarctic ice sheet under bright overcast sky. An emperor penguin with black back and yellow-orange ear patches waddles side to side.
Interactions: A → A → D → D

LTX 2.3

Causal Fidelity 0.0Visual Plausibility 60.0

HY-Video 1.5

Causal Fidelity 100.0Visual Plausibility 80.0

Statistics

289 cases, 4 interaction types, 6 scene categories, 5 subject types.

Perspective

FPP(178)TPP(111)

Turns/Case

2(67)3(39)4(148)5(16)6+(19)

Interaction

Navigation(601)Subject Action(213)Event Edit(183)PS(61)

SA Type

Locomotion(19)Manipulation(19)Tool Use(19)NPC(19)

EE Type

Environment(21)Appearance(20)Obj.Mech(19)Obj.Phys(16)Obj.Natural(12)

Subject

Human(126)Animal(18)Robot(17)Vehicle(14)Other(114)

Scene

Nature(89)Urban(61)Indoor(47)Work(40)Fantasy(29)Sports(23)

Style

Photorealistic(150)Styled(139)

Turns

4 turns5+32

Dataset Gallery

Browse all 289 cases. Hover images with mask badge to see subject segmentation overlay.

Showing 289 / 289 cases

Citation

If you find our work useful, please consider citing:

@article{ying2026wbenchcomprehensivemultiturnbenchmark,
  title={WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation},
  author={Ying, Kaining and Hu, Hengrui and Ren, Siyu and Li, Jiamu and Chen, Fengjiao and Wang, Ziwen and Cao, Xuezhi and Cai, Xunliang and Ding, Henghui},
  journal={arXiv preprint arXiv:2605.25874},
  year={2026}
}