E₀

Abstract

Vision–Language–Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. Yet existing VLA models still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We introduce $\mathcal{E}_0$, a continuized discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. Compared with continuous diffusion policies, $\mathcal{E}_0$ offers two key advantages: (1) discrete action tokens align naturally with the symbolic structure of pretrained VLM/VLA backbones, enabling stronger semantic conditioning; and (2) discrete diffusion matches the true quantized nature of real-world robot control—whose hardware constraints (e.g., encoder resolution, control frequency, actuation latency) inherently discretize continuous signals—and therefore benefits from a Bayes-optimal denoiser that models the correct discrete action distribution, leading to stronger generalization. Compared with discrete autoregressive and mask-based discrete diffusion models, $\mathcal{E}_0$ supports a significantly larger and finer-grained action vocabulary and avoids the distributional mismatch introduced by masking-based corruptions—yielding more accurate fine-grained action control. We further introduce a spherical viewpoint perturbation augmentation method to improve robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, and ManiSkill show that $\mathcal{E}_0$ achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average. Real-world evaluation on a Franka arm confirms that $\mathcal{E}_0$ delivers precise, robust, and transferable manipulation, establishing discrete diffusion as a promising direction for generalizable VLA policy learning.

Overview

Fig.1 Overview and detailed illustration of $\mathcal{E}_0$. (a) Overall architecture of the proposed model. (b) Training and inference pipeline, showing how inputs are encoded, diffused, and decoded into executable action sequences.

Fig.2 Overview of action modeling paradigms. (a) Discrete modeling: Traditional autoregressive (AR) approaches and recent mask-based discrete diffusion methods, which operate over a small discrete action vocabulary. (b) Continuous modeling: Continuous diffusion–based policies and AR–diffusion hybrids that regress continuous actions. (c) Our approach: $\mathcal{E}_0$ integrates AR-style conditioning with continuized discrete diffusion, enabling efficient action generation while preserving compatibility with pretrained vision–language backbones and supporting fine-grained action control.

More

Results

Ablation
Experiments

Simulation
Experiments

Real World
Experiments

Ablation Experiments

Fig.3 Comprehensive ablation analysis of key hyperparameters in the LIBERO environments. We investigate four crucial design factors influencing our model: (a) Discretization bins—increasing bin granularity enhances precision up to 2048 bins, beyond which gains saturate; (b) Action horizon—a moderate prediction length balances reactivity and temporal consistency; (c) Action dimensions—embedding sizes slightly above the dataset’s action space yield the best expressiveness–robustness trade-off; and (d) One-hot smooth factor—moderate decay values smooth discrete logits, stabilizing diffusion and improving overall success rate.

LIBERO Compare Results

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

Comparison on the LIBERO benchmark. In the task pick up the milk and place it in the basket , our $\mathcal{E}_0$ successfully grasped the object with the correct posture and completed the task without knocking it over.

ManiSkill Compare Results

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

Comparison on the ManiSkill benchmark. In the task PegInsertionSide , our $\mathcal{E}_0$ achieved the best performance in this highly dexterous task. It was able to precisely align the peg with the hole and successfully insert it, while other models performed poorly.

VLABench Compare Results

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

Comparison on the VLABench benchmark. In the task pick up the spade 3 , our $\mathcal{E}_0$ correctly identifies and precisely grasps the target card , showing superior multimodal reasoning and control.

RoboTwin Results

adjust bottle

beat block hammer

click alarmclock

click bell

dump bin bigbin

move can pot

move pillbottle pad

move playingcard away

move stapler pad

open laptop

open microwave

place a2b left

place a2b right

place container plate

place empty cup

place fan

place mouse pad

place object scale

place object stand

place phone stand

place shoe

press stapler

rotate qrcode

shake bottle

shake bottle horizontally

stamp seal

turn switch

blocks ranking rgb

blocks ranking size

grab roller

handover block

handover mic

hanging mug

lift pot

pick diverse bottles

pick dual bottles

place bread basket

place bread skillet

place burger fries

place can basket

place cans plasticbox

place dual shoes

place object basket

put bottles dustbin

put object cabinet

scan object

stack blocks three

stack blocks two

stack bowls three

stack bowls two

RoboTwin benchmark overview. The benchmark consists of 27 single-arm and 23 dual-arm manipulation tasks, spanning a diverse range of objects and actions. Several dual-arm tasks additionally require coordinated bimanual control, posing higher challenges for policy learning.

Real-World Experiments

Short-Horizon Tasks

Performance on real-world robotic experiments. Short-horizon tasks ( pick block, press button, close door, pull drawer, stack block ).

Long-Horizon Tasks

Performance on real-world robotic experiments. Long-horizon tasks ( pick block twice, pull drawer and put in block, and put in plate and close door ).

Unseen Scenario

Comparison of keyframes in the real-world pick twice task under unseen scenarios. During evaluation, we tested the model across various unseen scenarios. In most cases, the model successfully completed the task. In a few cases, task quality is slightly compromised but still acceptable—for example, placement deviations caused by an oversized green dish, or changes in the order of object picking due to color variations.

Human Intervention

Comparison of keyframes in real-world pick twice task with and without human intervention. We introduce human perturbation by manually shifting the target cube to disrupt the model's original plan. After the interruption, the model is able to promptly adapt and replan its actions.

Pick Vegetables Twice Tasks

Qualitative results on the real-world Pick Vegetables Twice task under randomly arranged tabletop scenes. Each video shows one complete execution sequence consisting of two consecutive pick-and-place operations (carrot and cucumber, potato and eggplant, pepper and corn) . Across all settings, the surrounding distractor objects are placed in completely random configurations, demonstrating the robustness of our policy under visually diverse and cluttered real-world environments.

Let's Get In Touch!

If you have any questions, feel free to send us a message and we'll get back to you as soon as possible!
Email: zhanzhh6@mail2.sysu.edu.cn

Abstract

Overview

Results

Ablation Experiments

Simulation Experiments

Real World Experiments

Ablation Experiments

LIBERO Compare Results

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

ManiSkill Compare Results

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

VLABench Compare Results

$\pi_0$ Failure

$\pi_0$-FAST Failure

$\pi_{0.5}$ Failure

$\mathcal{E}_0$ Success

RoboTwin Results

adjust bottle

beat block hammer

click alarmclock

click bell

dump bin bigbin

move can pot

move pillbottle pad

move playingcard away

move stapler pad

open laptop

open microwave

place a2b left

place a2b right

place container plate

place empty cup

place fan

place mouse pad

place object scale

place object stand

place phone stand

place shoe

press stapler

rotate qrcode

shake bottle

shake bottle horizontally

stamp seal

turn switch

blocks ranking rgb

blocks ranking size

grab roller

handover block

handover mic

hanging mug

lift pot

pick diverse bottles

pick dual bottles

place bread basket

place bread skillet

place burger fries

place can basket

place cans plasticbox

place dual shoes

place object basket

put bottles dustbin

put object cabinet

scan object

stack blocks three

stack blocks two

stack bowls three

stack bowls two

Real-World Experiments

Short-Horizon Tasks

Long-Horizon Tasks

Unseen Scenario

Human Intervention

Pick Vegetables Twice Tasks

Let's Get In Touch!

Ablation
Experiments

Simulation
Experiments

Real World
Experiments