Zhihao Zhan¹, Jiaying Zhou¹, Likui Zhang¹, Qinhan Lv¹, Hao Liu¹, Jusheng Zhang¹, Weizheng Li¹, Ziliang Chen¹, Tianshui Chen³⁴, Keze Wang¹, Liang Lin¹²³, Guangrun Wang¹²³
¹ Sun Yat-sen University, Guangzhou, China
² Guangdong Key Laboratory of Big Data Analysis and Processing
³ X-Era AI Lab
⁴ Guangdong University of Technology
Vision–Language–Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. Yet existing VLA models still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We introduce $\mathcal{E}_0$, a continuized discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. Compared with continuous diffusion policies, $\mathcal{E}_0$ offers two key advantages: (1) discrete action tokens align naturally with the symbolic structure of pretrained VLM/VLA backbones, enabling stronger semantic conditioning; and (2) discrete diffusion matches the true quantized nature of real-world robot control—whose hardware constraints (e.g., encoder resolution, control frequency, actuation latency) inherently discretize continuous signals—and therefore benefits from a Bayes-optimal denoiser that models the correct discrete action distribution, leading to stronger generalization. Compared with discrete autoregressive and mask-based discrete diffusion models, $\mathcal{E}_0$ supports a significantly larger and finer-grained action vocabulary and avoids the distributional mismatch introduced by masking-based corruptions—yielding more accurate fine-grained action control. We further introduce a spherical viewpoint perturbation augmentation method to improve robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, and ManiSkill show that $\mathcal{E}_0$ achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average. Real-world evaluation on a Franka arm confirms that $\mathcal{E}_0$ delivers precise, robust, and transferable manipulation, establishing discrete diffusion as a promising direction for generalizable VLA policy learning.
Fig.1 Overview and detailed illustration of $\mathcal{E}_0$. (a) Overall architecture of the proposed model. (b) Training and inference pipeline, showing how inputs are encoded, diffused, and decoded into executable action sequences.
Fig.2 Overview of action modeling paradigms. (a) Discrete modeling: Traditional autoregressive (AR) approaches and recent mask-based discrete diffusion methods, which operate over a small discrete action vocabulary. (b) Continuous modeling: Continuous diffusion–based policies and AR–diffusion hybrids that regress continuous actions. (c) Our approach: $\mathcal{E}_0$ integrates AR-style conditioning with continuized discrete diffusion, enabling efficient action generation while preserving compatibility with pretrained vision–language backbones and supporting fine-grained action control.
More
Fig.3 Comprehensive ablation analysis of key hyperparameters in the LIBERO environments. We investigate four crucial design factors influencing our model: (a) Discretization bins—increasing bin granularity enhances precision up to 2048 bins, beyond which gains saturate; (b) Action horizon—a moderate prediction length balances reactivity and temporal consistency; (c) Action dimensions—embedding sizes slightly above the dataset’s action space yield the best expressiveness–robustness trade-off; and (d) One-hot smooth factor—moderate decay values smooth discrete logits, stabilizing diffusion and improving overall success rate.
Comparison on the LIBERO benchmark. In the task pick up the milk and place it in the basket , our $\mathcal{E}_0$ successfully grasped the object with the correct posture and completed the task without knocking it over.
Comparison on the ManiSkill benchmark. In the task PegInsertionSide , our $\mathcal{E}_0$ achieved the best performance in this highly dexterous task. It was able to precisely align the peg with the hole and successfully insert it, while other models performed poorly.
Comparison on the VLABench benchmark. In the task pick up the spade 3 , our $\mathcal{E}_0$ correctly identifies and precisely grasps the target card , showing superior multimodal reasoning and control.
RoboTwin benchmark overview. The benchmark consists of 27 single-arm and 23 dual-arm manipulation tasks, spanning a diverse range of objects and actions. Several dual-arm tasks additionally require coordinated bimanual control, posing higher challenges for policy learning.
Performance on real-world robotic experiments. Short-horizon tasks ( pick block, press button, close door, pull drawer, stack block ).
Performance on real-world robotic experiments. Long-horizon tasks ( pick block twice, pull drawer and put in block, and put in plate and close door ).
Comparison of keyframes in the real-world pick twice task under unseen scenarios. During evaluation, we tested the model across various unseen scenarios. In most cases, the model successfully completed the task. In a few cases, task quality is slightly compromised but still acceptable—for example, placement deviations caused by an oversized green dish, or changes in the order of object picking due to color variations.
Comparison of keyframes in real-world pick twice task with and without human intervention. We introduce human perturbation by manually shifting the target cube to disrupt the model's original plan. After the interruption, the model is able to promptly adapt and replan its actions.
Qualitative results on the real-world Pick Vegetables Twice task under randomly arranged tabletop scenes. Each video shows one complete execution sequence consisting of two consecutive pick-and-place operations (carrot and cucumber, potato and eggplant, pepper and corn) . Across all settings, the surrounding distractor objects are placed in completely random configurations, demonstrating the robustness of our policy under visually diverse and cluttered real-world environments.
If you have any questions, feel free to send us a message and we'll get back to you as soon as possible!
Email: zhanzhh6@mail2.sysu.edu.cn