Want a Xiaomi robot? Build one yourself with open source Xiaomi-Robotics-0

On February 12, 2026, Xiaomi officially released its first open-source Robotic VLA (Vision-Language-Action) model, named “Xiaomi-Robotics-0”. With 4.7 billion parameters, this model combines visual-language understanding with high-performance real-time execution capabilities, setting new SOTA (State of the Art) records across multiple benchmarks.

Here are the key technical features and capabilities of the model:

Architecture: Brain and Cerebellum Collaboration

To balance general understanding with precise control, Xiaomi-Robotics-0 utilizes a “Mixture-of-Transformers” (MoT) architecture.

Visual-Language Brain (VLM): Built on a multimodal VLM base, it is responsible for understanding vague human commands (e.g., “please fold the towel”) and capturing spatial relationships from high-definition visual inputs.
Action Execution Cerebellum (Action Expert): To generate high-frequency and smooth movements, a multi-layer Diffusion Transformer (DiT) is embedded. Instead of outputting a single action, it generates an “Action Chunk” and ensures precision using flow-matching technology.

Training Strategy: Preventing “Dumbing Down”

Many VLA models tend to lose their general understanding capabilities while learning actions. Xiaomi addresses this with a hybrid training method combining multimodal data with action data:

VLM Synergistic Training: An “Action Proposal” mechanism forces the VLM to predict action distributions while understanding images, aligning the VLM’s feature space with the action space.
DiT Specialized Training: The VLM is frozen, and the DiT is trained to recover precise action sequences from noise, relying entirely on KV features for conditional generation.

141fa773 9c5b 47c1 bd9c 26faa7711f13 — Training methods

Real-Time and Fluid Movements

To solve the “action stuttering” caused by inference latency in real robots, the team introduced innovative techniques:

Asynchronous Inference: Decouples the model’s reasoning process from the robot’s execution, allowing them to run asynchronously for smoother operation.
Clean Action Prefix: Uses the previously predicted action as input to ensure trajectory continuity and reduce jitter.
$\Lambda$ -shape Attention Mask: A special attention mask forces the model to focus on current visual feedback rather than historical inertia, making the robot highly responsive to sudden environmental changes.

0db7d972 d009 4eeb 97bb 5d524a234d92 — Robot data

Performance & Availability

Benchmark Leader: The model achieved top results among 30 models in simulation benchmarks like LIBERO, CALVIN, and SimplerEnv.
Real-World Challenges: In tests with dual-arm robots, it demonstrated superior hand-eye coordination in long-horizon tasks like dismantling blocks and folding soft towels.
Hardware Compatibility: It supports real-time inference on consumer-grade graphics cards.