SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed—insufficient under perceptual ambiguity. We propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on self-uncertainty, inspired by Active Inference theory—requiring no additional training, no verifier, and only a single forward pass.

Vision-Language-Action (VLA) models enable robots to follow natural language instructions, but they struggle with robustness at test time. Existing test-time scaling (TTS) approaches try to address this, but they come with significant drawbacks that limit practical deployment.

Limitations of existing TTS methods vs. our approach

Multiple Forward Passes

Existing TTS methods require sampling multiple action candidates and evaluating them, significantly increasing inference time and computational cost.

External Verifiers

Many approaches rely on separate verifier models or value functions to select among candidates, adding complexity and training overhead.

Fixed Visual Perception

Current methods only intervene at action decoding, keeping visual representations fixed. Under perceptual ambiguity, how to perceive matters as much as what to do.

Key Insight: Inspired by Active Inference theory, we propose to jointly modulate both perception and action based on the model's own uncertainty—broadening exploration when uncertain, and focusing on exploitation when confident. This enables adaptive behavior with no additional training, no verifier, and only a single forward pass.

SCALE computes a self-uncertainty measure from the model's own predictions, then uses it to adaptively control both visual attention (how the model perceives the scene) and action decoding (how the model selects actions). High uncertainty triggers broader exploration; low uncertainty enables focused exploitation.

Overview of SCALE: Self-uncertainty modulates both visual perception and action decoding

Adaptive Looking (Visual Attention)

Under high uncertainty, SCALE broadens the visual attention to consider a wider context of the scene. When confident, it focuses attention on task-relevant regions for precise execution.

Adaptive Execution (Action Decoding)

High uncertainty leads to more exploratory action sampling (higher temperature). Low uncertainty enables deterministic, confident action selection for reliable execution.

SCALE consistently improves state-of-the-art VLA models across both simulated and real-world benchmarks, outperforming existing test-time scaling methods while maintaining single-pass efficiency.

Simulation Experiments

Success Rate (%) on LIBERO with OpenVLA backbone. Additional experiments with various VLA architectures are available in the paper.

Real-World Experiments

Succes (%) on real-world “Put A on B” pick-and-place tasks (A/B = object/receptacle) under ID and OOD conditions.

BibTeX

@article{choiALKCC2026scale,
  author    = {Choi, Hyeonbeom and Ahn, Daechul and Lee, Youhan and Kang, Taewook and Cho, Seongwon and Choi, Jonghyun},
  title     = {SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models},
  journal   = {arXiv preprint arXiv:2602.04208},
  year      = {2026},
}

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

arXiv 2026

Abstract

The Robustness Challenge in VLA Models