Zhihao Zhan¹, Jiaying Zhou¹, Likui Zhang¹, Qinhan Lyu¹, Hao Liu¹, Jusheng Zhang¹, Weizheng Li¹, Ziliang Chen¹, Tianshui Chen³⁴, Ruifeng Zhai¹, Keze Wang¹, Liang Lin¹²³, Guangrun Wang¹²³
¹ Sun Yat-sen University, Guangzhou, China
² Guangdong Key Laboratory of Big Data Analysis and Processing
³ X-Era AI Lab
⁴ Guangdong University of Technology
Vision–Language–Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. However, existing VLA systems still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We argue that these limitations are closely tied to the structural properties of actions in VLA settings, including the inherent multi-peaked nature of action distributions, the token-based symbolic reasoning of pretrained VLM/VLA backbones, and the effective finite resolution imposed by real-world robotic control. Motivated by these properties, we introduce $\mathcal{E}_0$, a tweedie discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. By operating in a discrete action space with a principled diffusion process, $\mathcal{E}_0$ naturally aligns with token-based reasoning, supports fine-grained yet executable action control, and avoids the distributional mismatch of masking-based discrete diffusion. We further introduce a spherical viewpoint perturbation augmentation to enhance robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, ManiSkill, and a real-world Franka arm demonstrate that $\mathcal{E}_0$ achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7\% on average.
Fig.1 Overview and detailed illustration of $\mathcal{E}_0$. (a) Overall architecture of the proposed model. (b) Training and inference pipeline, showing how inputs are encoded, diffused, and decoded into executable action sequences.
Fig.2 Overview of action modeling paradigms. (a) Discrete modeling: Traditional autoregressive (AR) approaches and recent mask-based discrete diffusion methods, which operate over a small discrete action vocabulary. (b) Continuous modeling: Continuous diffusion–based policies and AR–diffusion hybrids that regress continuous actions. (c) Our approach: $\mathcal{E}_0$ integrates AR-style conditioning with continuized discrete diffusion, enabling efficient action generation while preserving compatibility with pretrained vision–language backbones and supporting fine-grained action control.
More
Fig.3 Comprehensive ablation analysis of key hyperparameters in the LIBERO environments. We investigate four crucial design factors influencing our model: (a) Discretization bins—increasing bin granularity enhances precision up to 2048 bins, beyond which gains saturate; (b) Action horizon—a moderate prediction length balances reactivity and temporal consistency; (c) Action dimensions—embedding sizes slightly above the dataset’s action space yield the best expressiveness–robustness trade-off; and (d) One-hot smooth factor—moderate decay values smooth discrete logits, stabilizing diffusion and improving overall success rate.
In the task pick up the milk and place it in the basket , our $\mathcal{E}_0$ successfully grasped the object with the correct posture and completed the task without knocking it over.
In the task put_the_bowl_on_the_plate , our $\mathcal{E}_0$ accurately and swiftly grasped the object and completed the task.
In the task PegInsertionSide , our $\mathcal{E}_0$ achieved the best performance in this highly dexterous task. It was able to precisely align the peg with the hole and successfully insert it, while other models performed poorly.
In the task PlugCharger , our $\mathcal{E}_0$ performed the task with high precision and fine-grained control.
In the task StackCube , our $\mathcal{E}_0$ accomplished the task remarkably well.
In the task pick up the spade 3 , our $\mathcal{E}_0$ correctly identifies and precisely grasps the target card , showing superior multimodal reasoning and control.
In the task select mahjong , our $\mathcal{E}_0$ correctly identified the target object specified by the instruction and completed the task without knocking over any other items.
In the task select painting , our $\mathcal{E}_0$ correctly interpreted the task instruction and completed the task accordingly, rather than mechanically pressing the center button.
RoboTwin benchmark overview. The benchmark consists of 27 single-arm and 23 dual-arm manipulation tasks, spanning a diverse range of objects and actions. Several dual-arm tasks additionally require coordinated bimanual control, posing higher challenges for policy learning.
Performance on real-world robotic experiments. Short-horizon tasks ( pick block, press button, close door, pull drawer, stack block ).
Performance on real-world robotic experiments. Long-horizon tasks ( pick block twice, pull drawer and put in block, and put in plate and close door ).
Comparison of keyframes in the real-world pick twice task under unseen scenarios. During evaluation, we tested the model across various unseen scenarios. In most cases, the model successfully completed the task. In a few cases, task quality is slightly compromised but still acceptable—for example, placement deviations caused by an oversized green dish, or changes in the order of object picking due to color variations.
Comparison of keyframes in real-world pick twice task with and without human intervention. We introduce human perturbation by manually shifting the target cube to disrupt the model's original plan. After the interruption, the model is able to promptly adapt and replan its actions.
Qualitative results on the real-world Pick Vegetables Twice task under randomly arranged tabletop scenes. Each video shows one complete execution sequence consisting of two consecutive pick-and-place operations (carrot and cucumber, potato and eggplant, pepper and corn) . Across all settings, the surrounding distractor objects are placed in completely random configurations, demonstrating the robustness of our policy under visually diverse and cluttered real-world environments.
If you have any questions, feel free to send us a message and we'll get back to you as soon as possible!
Email: zhanzhh6@mail2.sysu.edu.cn