$\mathcal{E}_0$: Enhancing Generalization and Fine-Grained Control in VLA Models
via Tweedie Discrete Diffusion


Zhihao Zhan¹,   Jiaying Zhou¹,   Likui Zhang¹,   Qinhan Lyu¹,   Hao Liu¹,   Jusheng Zhang¹,   Weizheng Li¹,   Ziliang Chen¹,   Tianshui Chen³⁴,   Ruifeng Zhai¹,   Keze Wang¹,   Liang Lin¹²³,   Guangrun Wang¹²³

¹ Sun Yat-sen University, Guangzhou, China
² Guangdong Key Laboratory of Big Data Analysis and Processing
³ X-Era AI Lab
⁴ Guangdong University of Technology

Abstract


Vision–Language–Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. However, existing VLA systems still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We argue that these limitations are closely tied to the structural properties of actions in VLA settings, including the inherent multi-peaked nature of action distributions, the token-based symbolic reasoning of pretrained VLM/VLA backbones, and the effective finite resolution imposed by real-world robotic control. Motivated by these properties, we introduce $\mathcal{E}_0$, a tweedie discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. By operating in a discrete action space with a principled diffusion process, $\mathcal{E}_0$ naturally aligns with token-based reasoning, supports fine-grained yet executable action control, and avoids the distributional mismatch of masking-based discrete diffusion. We further introduce a spherical viewpoint perturbation augmentation to enhance robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, ManiSkill, and a real-world Franka arm demonstrate that $\mathcal{E}_0$ achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7\% on average.

Overview


Fig.1 Overview and detailed illustration of $\mathcal{E}_0$. (a) Overall architecture of the proposed model. (b) Training and inference pipeline, showing how inputs are encoded, diffused, and decoded into executable action sequences.


Fig.2 Overview of action modeling paradigms. (a) Discrete modeling: Traditional autoregressive (AR) approaches and recent mask-based discrete diffusion methods, which operate over a small discrete action vocabulary. (b) Continuous modeling: Continuous diffusion–based policies and AR–diffusion hybrids that regress continuous actions. (c) Our approach: $\mathcal{E}_0$ integrates AR-style conditioning with continuized discrete diffusion, enabling efficient action generation while preserving compatibility with pretrained vision–language backbones and supporting fine-grained action control.

More

Results


Ablation Experiments


Fig.3 Comprehensive ablation analysis of key hyperparameters in the LIBERO environments. We investigate four crucial design factors influencing our model: (a) Discretization bins—increasing bin granularity enhances precision up to 2048 bins, beyond which gains saturate; (b) Action horizon—a moderate prediction length balances reactivity and temporal consistency; (c) Action dimensions—embedding sizes slightly above the dataset’s action space yield the best expressiveness–robustness trade-off; and (d) One-hot smooth factor—moderate decay values smooth discrete logits, stabilizing diffusion and improving overall success rate.

LIBERO Compare Results


$\pi_0$ Failure
$\pi_0$-FAST Failure
$\pi_{0.5}$ Failure
$\mathcal{E}_0$ Success

In the task pick up the milk and place it in the basket , our $\mathcal{E}_0$ successfully grasped the object with the correct posture and completed the task without knocking it over.

$\pi_0$ Failure
$\pi_0$-FAST Failure
$\pi_{0.5}$ Failure
$\mathcal{E}_0$ Success

In the task put_the_bowl_on_the_plate , our $\mathcal{E}_0$ accurately and swiftly grasped the object and completed the task.

ManiSkill Compare Results


$\pi_0$ Failure
$\pi_0$-FAST Failure
$\pi_{0.5}$ Failure
$\mathcal{E}_0$ Success

In the task PegInsertionSide , our $\mathcal{E}_0$ achieved the best performance in this highly dexterous task. It was able to precisely align the peg with the hole and successfully insert it, while other models performed poorly.

$\pi_0$ Failure
$\pi_0$-FAST Failure
$\pi_{0.5}$ Failure
$\mathcal{E}_0$ Success

In the task PlugCharger , our $\mathcal{E}_0$ performed the task with high precision and fine-grained control.

$\pi_0$ Failure
$\pi_0$-FAST Failure
$\pi_{0.5}$ Failure
$\mathcal{E}_0$ Success

In the task StackCube , our $\mathcal{E}_0$ accomplished the task remarkably well.

VLABench Compare Results


$\pi_0$ Failure
$\pi_0$-FAST Failure
$\pi_{0.5}$ Failure
$\mathcal{E}_0$ Success

In the task pick up the spade 3 , our $\mathcal{E}_0$ correctly identifies and precisely grasps the target card , showing superior multimodal reasoning and control.

$\pi_0$ Failure
$\pi_0$-FAST Failure
$\pi_{0.5}$ Failure
$\mathcal{E}_0$ Success

In the task select mahjong , our $\mathcal{E}_0$ correctly identified the target object specified by the instruction and completed the task without knocking over any other items.

$\pi_0$ Failure
$\pi_0$-FAST Failure
$\pi_{0.5}$ Failure
$\mathcal{E}_0$ Success

In the task select painting , our $\mathcal{E}_0$ correctly interpreted the task instruction and completed the task accordingly, rather than mechanically pressing the center button.

RoboTwin Results


adjust bottle
beat block hammer
click alarmclock
click bell
dump bin bigbin
move can pot
move pillbottle pad
move playingcard away
move stapler pad
open laptop
open microwave
place a2b left
place a2b right
place container plate
place empty cup
place fan
place mouse pad
place object scale
place object stand
place phone stand
place shoe
press stapler
rotate qrcode
shake bottle
shake bottle horizontally
stamp seal
turn switch
blocks ranking rgb
blocks ranking size
grab roller
handover block
handover mic
hanging mug
lift pot
pick diverse bottles
pick dual bottles
place bread basket
place bread skillet
place burger fries
place can basket
place cans plasticbox
place dual shoes
place object basket
put bottles dustbin
put object cabinet
scan object
stack blocks three
stack blocks two
stack bowls three
stack bowls two

RoboTwin benchmark overview. The benchmark consists of 27 single-arm and 23 dual-arm manipulation tasks, spanning a diverse range of objects and actions. Several dual-arm tasks additionally require coordinated bimanual control, posing higher challenges for policy learning.

Real-World Experiments


Short-Horizon Tasks

Performance on real-world robotic experiments. Short-horizon tasks ( pick block, press button, close door, pull drawer, stack block ).

Long-Horizon Tasks

Performance on real-world robotic experiments. Long-horizon tasks ( pick block twice, pull drawer and put in block, and put in plate and close door ).

Unseen Scenario

Comparison of keyframes in the real-world pick twice task under unseen scenarios. During evaluation, we tested the model across various unseen scenarios. In most cases, the model successfully completed the task. In a few cases, task quality is slightly compromised but still acceptable—for example, placement deviations caused by an oversized green dish, or changes in the order of object picking due to color variations.

Human Intervention

Comparison of keyframes in real-world pick twice task with and without human intervention. We introduce human perturbation by manually shifting the target cube to disrupt the model's original plan. After the interruption, the model is able to promptly adapt and replan its actions.

Pick Vegetables Twice Tasks

Qualitative results on the real-world Pick Vegetables Twice task under randomly arranged tabletop scenes. Each video shows one complete execution sequence consisting of two consecutive pick-and-place operations (carrot and cucumber, potato and eggplant, pepper and corn) . Across all settings, the surrounding distractor objects are placed in completely random configurations, demonstrating the robustness of our policy under visually diverse and cluttered real-world environments.

Let's Get In Touch!


If you have any questions, feel free to send us a message and we'll get back to you as soon as possible!
Email: zhanzhh6@mail2.sysu.edu.cn