LARY
A Latent Action Representation Yielding Benchmark
for Generalizable Vision-to-Action Alignment

A unified framework evaluating the foundational problem of utilizing large-scale human video datasets to bridge visual signals with ontology-independent actions, comprehensively assessing latent representations on both low-level robotic execution and high-level semantic tasks across 1M+ videos and 151 action categories.

📄 Paper 💻 Code Dataset

1M+

Video Clips

1,000+

Hours

151

Action Categories

Embodiments

Dujun Nie, Fengjiao Chen*, Qi Lv, Jun Kuang, Xiaoyu Li, Xuezhi Cao, Xunliang Cai
Meituan *Project leader

// Abstract

While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied LAMs. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.

// Leaderboard

Model Rankings

Comprehensive evaluation across classification accuracy and regression MSE.

Rank	Model	Paradigm	Params	Avg. Acc ↑	Comp. Human	Comp. Robot
🥇	V-JEPA2	Semantic	303.89M	75.39	80.35	70.43
🥈	DINOv3	Semantic	303.13M	72.63	76.19	69.06
🥉	Wan2.2	Pixel	704.69M	66.58	67.77	65.39
4	LAPA-DINOv3	General LAM	472.45M	45.62	64.19	27.04
5	FLUX.2-dev	Pixel	84.05M	45.24	46.12	44.36
6	LAPA-DINOv2	General LAM	473.69M	45.03	55.86	34.19
7	LAPA-MAGVIT2	General LAM	116.40M	44.62	59.70	29.53
8	LAPA-SigLIP2	General LAM	200.83M	42.09	54.74	29.44
9	villa-X	Embodied LAM	238.71M	23.85	17.80	29.90
10	LAPA	Embodied LAM	343.80M	19.13	14.61	23.64
11	UniVLA	Embodied LAM	287.75M	18.82	19.08	18.56

Rank	Model	Paradigm	Params	Avg. MSE ↓	CALVIN	VLABench	RoboCOIN	AgiBotWorld
🥇	DINOv3	Semantic	303.13M	0.19	0.22	0.06	0.22	0.24
🥈	V-JEPA2	Semantic	303.89M	0.25	0.27	0.07	0.32	0.33
🥉	Wan2.2	Pixel	704.69M	0.30	0.39	0.09	0.34	0.39
4	FLUX.2-dev	Pixel	84.05M	0.35	0.25	0.04	0.47	0.62
5	LAPA-DINOv3	General LAM	472.45M	0.60	0.50	0.25	0.82	0.84
6	LAPA-DINOv2	General LAM	473.69M	0.63	0.55	0.26	0.85	0.86
7	LAPA-MAGVIT2	General LAM	116.40M	0.65	0.59	0.36	0.80	0.83
8	LAPA-SigLIP2	General LAM	200.83M	0.65	0.57	0.30	0.86	0.88
9	villa-X	Embodied LAM	238.71M	0.87	0.86	0.72	0.94	0.97
10	UniVLA	Embodied LAM	287.75M	0.87	0.82	0.74	0.94	0.97
11	LAPA	Embodied LAM	343.80M	0.97	0.96	0.95	0.96	1.00

// Key Findings

What We Discovered

Four key insights from our comprehensive evaluation.

Finding 01

Robust Latent Actions Emerge from Large-Scale Visual Pre-training

General vision foundation models dramatically surpass specialized Embodied LAMs. V-JEPA 2 achieves 76.62% Acc and DINOv3 reaches 0.19 MSE, while Embodied LAMs remain at 18–21% Acc and 0.87–0.97 MSE. Large-scale self-supervised visual pre-training, even without explicit motion supervision, already encodes rich action-relevant structure.

Finding 02

Latent Feature Space Aligns Better with Robotic Actions than Pixel Space

Within general encoders, latent-based models (V-JEPA 2, DINOv3) significantly outperform pixel-based models (Wan2.2, FLUX.2-dev) on regression tasks, despite comparable classification performance. Decoding actions from learned latent representations offers a more effective pathway to general robotic control.

Finding 03

Diverse Pre-training Data Unlocks LAM Generalization

General LAMs, applying LAPA's self-supervised paradigm on general vision backbones, consistently outperform Embodied LAMs. LAPA-DINOv2 achieves 49.36% Acc vs. UniVLA's 17.99%, despite sharing the same backbone — the key differentiator is broader, more diverse training data, not architectural novelty.

Finding 04

Effective LAMs Require Robust Priors and Stable Quantization

Ablation studies reveal two critical design axes: self-supervised contrastive priors (e.g., DINOv3) consistently outperform reconstruction-based encoders for spatial-temporal correspondence, and quantization bottleneck stability — moderate codebook size (64), sequence length (49), latent dim (256) — achieves optimal capacity-stability balance.

// Dataset

Benchmark Overview

Covering atomic kinematic primitives and complex composite behaviors across diverse embodiments.

Action Category Distribution

Atomic kinematic primitives and composite behaviors across robot and human embodiments

Composite Human

652,297 samples · 123 categories

Composite Robot

538,423 samples · 54 categories

Atomic Robot

25,940 samples · 28 categories

Distribution of action categories across the LARY benchmark, covering atomic kinematic primitives and composite behaviors for both robot and human embodiments.

Figure 1a. Word cloud of action verbs.

Figure 1b. Word cloud of object nouns.

Atomic Robot

28 kinematic primitives, 25,940 image pairs from LIBERO.

Composite Human

123 classes, 692K clips from egocentric human datasets.

Composite Robot

54 classes, 538K clips across 11 robotic embodiments.

// Samples

Attention Heatmap Visualizations

Spatial attention maps from latent action models across diverse action categories, highlighting the regions most relevant to action recognition.

Positive Sample

Figure 2a. Positive sample — attention heatmap for catch action.

Positive Sample

Figure 2b. Positive sample — attention heatmap for compress action.

Negative Sample

Figure 2c. Negative sample — attention heatmap for brew action.

Negative Sample

Figure 2d. Negative sample — attention heatmap for clench action.

// Citation

Cite Our Work

@article{lary2026,
  title={LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment},
  author={Dujun Nie and Fengjiao Chen and Qi Lv and Jun Kuang and Xiaoyu Li and Xuezhi Cao and Xunliang Cai},
  year={2026},
  eprint={},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
}

LARY A Latent Action Representation Yielding Benchmarkfor Generalizable Vision-to-Action Alignment

Robust Latent Actions Emerge from Large-Scale Visual Pre-training

Latent Feature Space Aligns Better with Robotic Actions than Pixel Space

Diverse Pre-training Data Unlocks LAM Generalization

Effective LAMs Require Robust Priors and Stable Quantization

Atomic Robot

Composite Human

Composite Robot

LARY
A Latent Action Representation Yielding Benchmark
for Generalizable Vision-to-Action Alignment