A unified framework evaluating the foundational problem of utilizing large-scale human video datasets to bridge visual signals with ontology-independent actions, comprehensively assessing latent representations on both low-level robotic execution and high-level semantic tasks across 1M+ videos and 151 action categories.
Dujun Nie, Fengjiao Chen*, Qi Lv, Jun Kuang, Xiaoyu Li, Xuezhi Cao, Xunliang Cai
Meituan
*Project leader
While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied LAMs. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.
| Rank | Model | Paradigm | Params | Avg. Acc โ | Comp. Human | Comp. Robot |
|---|---|---|---|---|---|---|
| ๐ฅ | V-JEPA2 | Semantic | 303.89M | 75.39 | 80.35 | 70.43 |
| ๐ฅ | DINOv3 | Semantic | 303.13M | 72.63 | 76.19 | 69.06 |
| ๐ฅ | Wan2.2 | Pixel | 704.69M | 66.58 | 67.77 | 65.39 |
| 4 | LAPA-DINOv3 | General LAM | 472.45M | 45.62 | 64.19 | 27.04 |
| 5 | FLUX.2-dev | Pixel | 84.05M | 45.24 | 46.12 | 44.36 |
| 6 | LAPA-DINOv2 | General LAM | 473.69M | 45.03 | 55.86 | 34.19 |
| 7 | LAPA-MAGVIT2 | General LAM | 116.40M | 44.62 | 59.70 | 29.53 |
| 8 | LAPA-SigLIP2 | General LAM | 200.83M | 42.09 | 54.74 | 29.44 |
| 9 | villa-X | Embodied LAM | 238.71M | 23.85 | 17.80 | 29.90 |
| 10 | LAPA | Embodied LAM | 343.80M | 19.13 | 14.61 | 23.64 |
| 11 | UniVLA | Embodied LAM | 287.75M | 18.82 | 19.08 | 18.56 |
General vision foundation models dramatically surpass specialized Embodied LAMs. V-JEPA 2 achieves 76.62% Acc and DINOv3 reaches 0.19 MSE, while Embodied LAMs remain at 18โ21% Acc and 0.87โ0.97 MSE. Large-scale self-supervised visual pre-training, even without explicit motion supervision, already encodes rich action-relevant structure.
Within general encoders, latent-based models (V-JEPA 2, DINOv3) significantly outperform pixel-based models (Wan2.2, FLUX.2-dev) on regression tasks, despite comparable classification performance. Decoding actions from learned latent representations offers a more effective pathway to general robotic control.
General LAMs, applying LAPA's self-supervised paradigm on general vision backbones, consistently outperform Embodied LAMs. LAPA-DINOv2 achieves 49.36% Acc vs. UniVLA's 17.99%, despite sharing the same backbone โ the key differentiator is broader, more diverse training data, not architectural novelty.
Ablation studies reveal two critical design axes: self-supervised contrastive priors (e.g., DINOv3) consistently outperform reconstruction-based encoders for spatial-temporal correspondence, and quantization bottleneck stability โ moderate codebook size (64), sequence length (49), latent dim (256) โ achieves optimal capacity-stability balance.
Distribution of action categories across the LARY benchmark, covering atomic kinematic primitives and composite behaviors for both robot and human embodiments.
Figure 1a. Word cloud of action verbs.
Figure 1b. Word cloud of object nouns.
28 kinematic primitives, 25,940 image pairs from LIBERO.
123 classes, 692K clips from egocentric human datasets.
54 classes, 538K clips across 11 robotic embodiments.
Figure 2a. Positive sample โ attention heatmap for catch action.
Figure 2b. Positive sample โ attention heatmap for compress action.
Figure 2c. Negative sample โ attention heatmap for brew action.
Figure 2d. Negative sample โ attention heatmap for clench action.
@article{lary2026,
title={LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment},
author={Dujun Nie and Fengjiao Chen and Qi Lv and Jun Kuang and Xiaoyu Li and Xuezhi Cao and Xunliang Cai},
year={2026},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL},
}