LARY
A Latent Action Representation Yielding Benchmark
for Generalizable Vision-to-Action Alignment

A unified framework evaluating the foundational problem of utilizing large-scale human video datasets to bridge visual signals with ontology-independent actions, comprehensively assessing latent representations on both low-level robotic execution and high-level semantic tasks across 1M+ videos and 151 action categories.

๐Ÿ“„ Paper ๐Ÿ’ป Code Dataset
1M+
Video Clips
1,000+
Hours
151
Action Categories
11
Embodiments

Dujun Nie, Fengjiao Chen*, Qi Lv, Jun Kuang, Xiaoyu Li, Xuezhi Cao, Xunliang Cai
Meituan   *Project leader  

// Abstract

While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied LAMs. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.

LARY overview
// Leaderboard
Model Rankings
Comprehensive evaluation across classification accuracy and regression MSE.
RankModelParadigmParamsAvg. Acc โ†‘Comp. HumanComp. Robot
๐Ÿฅ‡V-JEPA2Semantic303.89M75.3980.3570.43
๐ŸฅˆDINOv3Semantic303.13M72.6376.1969.06
๐Ÿฅ‰Wan2.2Pixel704.69M66.5867.7765.39
4LAPA-DINOv3General LAM472.45M45.6264.1927.04
5FLUX.2-devPixel84.05M45.2446.1244.36
6LAPA-DINOv2General LAM473.69M45.0355.8634.19
7LAPA-MAGVIT2General LAM116.40M44.6259.7029.53
8LAPA-SigLIP2General LAM200.83M42.0954.7429.44
9villa-XEmbodied LAM238.71M23.8517.8029.90
10LAPAEmbodied LAM343.80M19.1314.6123.64
11UniVLAEmbodied LAM287.75M18.8219.0818.56
// Key Findings
What We Discovered
Four key insights from our comprehensive evaluation.
Finding 01

Robust Latent Actions Emerge from Large-Scale Visual Pre-training

General vision foundation models dramatically surpass specialized Embodied LAMs. V-JEPA 2 achieves 76.62% Acc and DINOv3 reaches 0.19 MSE, while Embodied LAMs remain at 18โ€“21% Acc and 0.87โ€“0.97 MSE. Large-scale self-supervised visual pre-training, even without explicit motion supervision, already encodes rich action-relevant structure.

Finding 02

Latent Feature Space Aligns Better with Robotic Actions than Pixel Space

Within general encoders, latent-based models (V-JEPA 2, DINOv3) significantly outperform pixel-based models (Wan2.2, FLUX.2-dev) on regression tasks, despite comparable classification performance. Decoding actions from learned latent representations offers a more effective pathway to general robotic control.

Finding 03

Diverse Pre-training Data Unlocks LAM Generalization

General LAMs, applying LAPA's self-supervised paradigm on general vision backbones, consistently outperform Embodied LAMs. LAPA-DINOv2 achieves 49.36% Acc vs. UniVLA's 17.99%, despite sharing the same backbone โ€” the key differentiator is broader, more diverse training data, not architectural novelty.

Finding 04

Effective LAMs Require Robust Priors and Stable Quantization

Ablation studies reveal two critical design axes: self-supervised contrastive priors (e.g., DINOv3) consistently outperform reconstruction-based encoders for spatial-temporal correspondence, and quantization bottleneck stability โ€” moderate codebook size (64), sequence length (49), latent dim (256) โ€” achieves optimal capacity-stability balance.

// Dataset
Benchmark Overview
Covering atomic kinematic primitives and complex composite behaviors across diverse embodiments.
Action Category Distribution
Atomic kinematic primitives and composite behaviors across robot and human embodiments
Composite Human
652,297 samples ยท 123 categories
Composite Robot
538,423 samples ยท 54 categories
Atomic Robot
25,940 samples ยท 28 categories

Distribution of action categories across the LARY benchmark, covering atomic kinematic primitives and composite behaviors for both robot and human embodiments.

Action Word Cloud

Figure 1a. Word cloud of action verbs.

Object Word Cloud

Figure 1b. Word cloud of object nouns.

Atomic Robot

Atomic Robot

28 kinematic primitives, 25,940 image pairs from LIBERO.

Composite Human

Composite Human

123 classes, 692K clips from egocentric human datasets.

Composite Robot

Composite Robot

54 classes, 538K clips across 11 robotic embodiments.

// Samples
Attention Heatmap Visualizations
Spatial attention maps from latent action models across diverse action categories, highlighting the regions most relevant to action recognition.
Positive Sample
Heatmap - Catch

Figure 2a. Positive sample โ€” attention heatmap for catch action.

Positive Sample
Heatmap - Compress

Figure 2b. Positive sample โ€” attention heatmap for compress action.

Negative Sample
Heatmap - Brew

Figure 2c. Negative sample โ€” attention heatmap for brew action.

Negative Sample
Heatmap - Clench

Figure 2d. Negative sample โ€” attention heatmap for clench action.

// Citation
Cite Our Work
@article{lary2026,
  title={LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment},
  author={Dujun Nie and Fengjiao Chen and Qi Lv and Jun Kuang and Xiaoyu Li and Xuezhi Cao and Xunliang Cai},
  year={2026},
  eprint={},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
}