E₀

Abstract

Vision–Language–Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. However, existing VLA systems still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We argue that these limitations are closely tied to the structural properties of actions in VLA settings, including the inherent multi-peaked nature of action distributions, the token-based symbolic reasoning of pretrained VLM/VLA backbones, and the effective finite resolution imposed by real-world robotic control. Motivated by these properties, we introduce $\mathcal{E}_0$, a tweedie discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. By operating in a discrete action space with a principled diffusion process, $\mathcal{E}_0$ naturally aligns with token-based reasoning, supports fine-grained yet executable action control, and avoids the distributional mismatch of masking-based discrete diffusion. We further introduce a spherical viewpoint perturbation augmentation to enhance robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, ManiSkill, and a real-world Franka arm demonstrate that $\mathcal{E}_0$ achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7\% on average.

Overview

Fig.1 Overview and detailed illustration of $\mathcal{E}_0$. (a) Overall architecture of the proposed model. (b) Training and inference pipeline, showing how inputs are encoded, diffused, and decoded into executable action sequences.

Fig.2 Overview of action modeling paradigms. (a) Discrete modeling: Traditional autoregressive (AR) approaches and recent mask-based discrete diffusion methods, which operate over a small discrete action vocabulary. (b) Continuous modeling: Continuous diffusion–based policies and AR–diffusion hybrids that regress continuous actions. (c) Our approach: $\mathcal{E}_0$ integrates AR-style conditioning with continuized discrete diffusion, enabling efficient action generation while preserving compatibility with pretrained vision–language backbones and supporting fine-grained action control.

More

Results

Ablation
Experiments

Simulation
Experiments

Real World
Experiments

Ablation Experiments

Fig.3 Comprehensive ablation analysis of key hyperparameters in the LIBERO environments. We investigate four crucial design factors influencing our model: (a) Discretization bins—increasing bin granularity enhances precision up to 2048 bins, beyond which gains saturate; (b) Action horizon—a moderate prediction length balances reactivity and temporal consistency; (c) Action dimensions—embedding sizes slightly above the dataset’s action space yield the best expressiveness–robustness trade-off; and (d) One-hot smooth factor—moderate decay values smooth discrete logits, stabilizing diffusion and improving overall success rate.

LIBERO Compare Results

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

In the task pick up the milk and place it in the basket , our $\mathcal{E}_0$ successfully grasped the object with the correct posture and completed the task without knocking it over.

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

In the task put_the_bowl_on_the_plate , our $\mathcal{E}_0$ accurately and swiftly grasped the object and completed the task.

ManiSkill Compare Results

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

In the task PegInsertionSide , our $\mathcal{E}_0$ achieved the best performance in this highly dexterous task. It was able to precisely align the peg with the hole and successfully insert it, while other models performed poorly.

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

In the task PlugCharger , our $\mathcal{E}_0$ performed the task with high precision and fine-grained control.

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

In the task StackCube , our $\mathcal{E}_0$ accomplished the task remarkably well.

VLABench Compare Results

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

In the task pick up the spade 3 , our $\mathcal{E}_0$ correctly identifies and precisely grasps the target card , showing superior multimodal reasoning and control.

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

In the task select mahjong , our $\mathcal{E}_0$ correctly identified the target object specified by the instruction and completed the task without knocking over any other items.

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

In the task select painting , our $\mathcal{E}_0$ correctly interpreted the task instruction and completed the task accordingly, rather than mechanically pressing the center button.

RoboTwin Results

adjust bottle

beat block hammer

click alarmclock

click bell

dump bin bigbin

move can pot

move pillbottle pad

move playingcard away

move stapler pad

open laptop

open microwave

place a2b left

place a2b right

place container plate

place empty cup

place fan

place mouse pad

place object scale

place object stand

place phone stand

place shoe

press stapler

rotate qrcode

shake bottle

shake bottle horizontally

stamp seal

turn switch

blocks ranking rgb

blocks ranking size

grab roller

handover block

handover mic

hanging mug

lift pot

pick diverse bottles

pick dual bottles

place bread basket

place bread skillet

place burger fries

place can basket

place cans plasticbox

place dual shoes

place object basket

put bottles dustbin

put object cabinet

scan object

stack blocks three

stack blocks two

stack bowls three

stack bowls two

RoboTwin benchmark overview. The benchmark consists of 27 single-arm and 23 dual-arm manipulation tasks, spanning a diverse range of objects and actions. Several dual-arm tasks additionally require coordinated bimanual control, posing higher challenges for policy learning.

Real-World Experiments

Short-Horizon Tasks

Performance on real-world robotic experiments. Short-horizon tasks ( pick block, press button, close door, pull drawer, stack block ).

Long-Horizon Tasks

Performance on real-world robotic experiments. Long-horizon tasks ( pick block twice, pull drawer and put in block, and put in plate and close door ).

Unseen Scenario

Comparison of keyframes in the real-world pick twice task under unseen scenarios. During evaluation, we tested the model across various unseen scenarios. In most cases, the model successfully completed the task. In a few cases, task quality is slightly compromised but still acceptable—for example, placement deviations caused by an oversized green dish, or changes in the order of object picking due to color variations.

Human Intervention

Comparison of keyframes in real-world pick twice task with and without human intervention. We introduce human perturbation by manually shifting the target cube to disrupt the model's original plan. After the interruption, the model is able to promptly adapt and replan its actions.

Pick Vegetables Twice Tasks

Qualitative results on the real-world Pick Vegetables Twice task under randomly arranged tabletop scenes. Each video shows one complete execution sequence consisting of two consecutive pick-and-place operations (carrot and cucumber, potato and eggplant, pepper and corn) . Across all settings, the surrounding distractor objects are placed in completely random configurations, demonstrating the robustness of our policy under visually diverse and cluttered real-world environments.

Let's Get In Touch!

If you have any questions, feel free to send us a message and we'll get back to you as soon as possible!
Email: zhanzhh6@mail2.sysu.edu.cn

Abstract

Overview

Results

Ablation Experiments

Simulation Experiments

Real World Experiments

Ablation Experiments

LIBERO Compare Results

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

ManiSkill Compare Results

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

VLABench Compare Results

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

RoboTwin Results

adjust bottle

beat block hammer

click alarmclock

click bell

dump bin bigbin

move can pot

move pillbottle pad

move playingcard away

move stapler pad

open laptop

open microwave

place a2b left

place a2b right

place container plate

place empty cup

place fan

place mouse pad

place object scale

place object stand

place phone stand

place shoe

press stapler

rotate qrcode

shake bottle

shake bottle horizontally

stamp seal

turn switch

blocks ranking rgb

blocks ranking size

grab roller

handover block

handover mic

hanging mug

lift pot

pick diverse bottles

pick dual bottles

place bread basket

Ablation
Experiments

Simulation
Experiments

Real World
Experiments