Why do current test-time scaling methods fall short?
Vision-Language-Action (VLA) models enable robots to follow natural language instructions, but they struggle with robustness at test time. Existing test-time scaling (TTS) approaches try to address this, but they come with significant drawbacks that limit practical deployment.
Limitations of existing TTS methods vs. our approach
Existing TTS methods require sampling multiple action candidates and evaluating them, significantly increasing inference time and computational cost.
Many approaches rely on separate verifier models or value functions to select among candidates, adding complexity and training overhead.
Current methods only intervene at action decoding, keeping visual representations fixed. Under perceptual ambiguity, how to perceive matters as much as what to do.
Key Insight: Inspired by Active Inference theory, we propose to jointly modulate both perception and action based on the model's own uncertainty—broadening exploration when uncertain, and focusing on exploitation when confident. This enables adaptive behavior with no additional training, no verifier, and only a single forward pass.
Jointly modulating visual attention and action decoding based on self-uncertainty
SCALE computes a self-uncertainty measure from the model's own predictions, then uses it to adaptively control both visual attention (how the model perceives the scene) and action decoding (how the model selects actions). High uncertainty triggers broader exploration; low uncertainty enables focused exploitation.
Overview of SCALE: Self-uncertainty modulates both visual perception and action decoding
Under high uncertainty, SCALE broadens the visual attention to consider a wider context of the scene. When confident, it focuses attention on task-relevant regions for precise execution.
High uncertainty leads to more exploratory action sampling (higher temperature). Low uncertainty enables deterministic, confident action selection for reliable execution.
SCALE consistently improves state-of-the-art VLA models across both simulated and real-world benchmarks, outperforming existing test-time scaling methods while maintaining single-pass efficiency.
Success Rate (%) on LIBERO with OpenVLA backbone. Additional experiments with various VLA architectures are available in the paper.
Succes (%) on real-world “Put A on B” pick-and-place tasks (A/B = object/receptacle) under ID and OOD conditions.
@article{choiALKCC2026scale,
author = {Choi, Hyeonbeom and Ahn, Daechul and Lee, Youhan and Kang, Taewook and Cho, Seongwon and Choi, Jonghyun},
title = {SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models},
journal = {arXiv preprint arXiv:2602.04208},
year = {2026},
}