Flash-WAM Modality-Aware Distillation for World Action Models

1Northeastern University, 2University of Georgia, 3EmbodyX Inc.
Flash-WAM per-chunk inference latency comparison

Flash-WAM compresses 25 video and 50 action denoising steps to just 1 each, achieving a 23× speedup and enabling real-time robot control on a single NVIDIA L40S GPU.

Abstract

World-action models (WAMs) jointly generate future video and robot actions via iterative diffusion, but require tens of denoising steps per chunk, precluding real-time control. Off-the-shelf step distillation fails here because the video and action streams use asymmetric noise schedules, causing the gradient signal for actions to vanish under standard consistency distillation.

We introduce Flash-WAM, a modality-aware distillation framework that assigns each stream its own consistency function matched to its noise regime. Instantiated on LingBot-VA, Flash-WAM compresses inference from 25 video and 50 action steps down to 1 each — a 23× speedup (8.1s → 348 ms) enabling real-time inference on a single NVIDIA L40S. Flash-WAM achieves 85.5% on RoboTwin 2.0 and 95.7% on LIBERO, while naive distillation drops to 24%.

Method Overview

Left: Naive consistency distillation fails on joint video-action models because the two modalities operate in structurally different noise regimes. The gradient signal for actions vanishes quadratically under standard LCM parametrization. Middle: Flash-WAM assigns each modality its own consistency function — a variance-preserving (Karras) parametrization for the high-noise video stream and a linear-gradient-scaling parametrization for the low-noise action stream. Right: The distilled student autoregressively generates video and actions with a single denoising step per modality.

Flash-WAM method overview

Results

RoboTwin 2.0

Success rates on RoboTwin 2.0 simulation (Clean and Randomized splits, 50 tasks) and speedup over the LingBot-VA teacher.

Method Nv Na Clean Rand. Average Speedup
π0 65.9258.4062.2
π0.5 82.7476.7679.8
X-VLA 72.972.872.8
Motus 88.6687.0287.8
LingBot-VA*2550 91.6490.8691.251.0×
LingBot-VA + DMD212 85.0872.3678.7419.0×
LingBot-VA + Video-only LCM12 80.6676.9278.79
LingBot-VA + Naive Joint LCM12 25.8822.0723.97
Ours12 88.4282.6685.54
LingBot-VA + DMD211 52.6648.4650.5623.3×
LingBot-VA + Video-only LCM11 77.9069.4673.68
LingBot-VA + Naive Joint LCM11 39.6832.9636.32
Ours11 82.5680.2681.41

Real-World Experiments

We evaluate Flash-WAM on a Unitree G1 humanoid robot equipped with Dex1-1 grippers across three manipulation tasks. Each task is tested 10 times per configuration. The baseline uses the original LingBot-VA setting with 3 video and 10 action denoising steps. Reducing steps without distillation causes significant degradation, while Flash-WAM recovers most of the baseline performance at drastically reduced inference cost.

Method Nv / Na T1 T2 T3 Average
LingBot-VA3 / 10 50%70%80%66.7%
LingBot-VA (reduced NFE)1 / 2 30%30%60%40.0%
LingBot-VA + Video-only LCM1 / 2 30%50%50%43.3%
Flash-WAM1 / 2 50%60%70%60.0%
LingBot-VA (reduced NFE)1 / 1 10%30%30%23.3%
LingBot-VA + Video-only LCM1 / 1 20%40%40%33.3%
Flash-WAM1 / 1 40%50%60%50.0%

Real-World Demos

End-to-end rollouts on the Unitree G1. The baseline (3v/10a) succeeds but is too slow for real-time control. Reducing steps to 1v/1a without distillation causes failure. Flash-WAM (1v/1a) recovers successful behavior at real-time speed.

LingBot-VA (3v/10a)

No Distillation (1v/1a)

Flash-WAM (1v/1a)

Qualitative Comparison

Video predictions from an open-loop autoregressive rollout on RoboTwin "pick diverse bottles". Both off-the-shelf distillation baselines (naive joint LCM and DMD2) visibly degrade: objects disappear or become blurred. Flash-WAM preserves scene structure and object identity throughout the rollout.

Qualitative comparison on RoboTwin pick diverse bottles task

BibTeX

@article{akbari2025flashwam,
  title     = {Flash-WAM: Modality-Aware Distillation for World Action Models},
  author    = {Akbari, Arman and Zhang, Ci and Akbari, Arash and Zhao, Lin and Chen, Yixiao and Chen, Weiwei and Zhang, Xuan and Yuan, Geng and Wang, Yanzhi},
  journal   = {arXiv preprint arXiv:2507.00780},
  year      = {2025},
}