Want a Xiaomi robot? Build one yourself with open source Xiaomi-Robotics-0

On February 12, 2026, Xiaomi officially released its first open-source Robotic VLA (Vision-Language-Action) model, named “Xiaomi-Robotics-0”. With 4.7 billion parameters, this model combines visual-language understanding with high-performance real-time execution capabilities, setting new SOTA (State of the Art) records across multiple benchmarks.

Here are the key technical features and capabilities of the model:

Architecture: Brain and Cerebellum Collaboration

To balance general understanding with precise control, Xiaomi-Robotics-0 utilizes a “Mixture-of-Transformers” (MoT) architecture.

  • Visual-Language Brain (VLM): Built on a multimodal VLM base, it is responsible for understanding vague human commands (e.g., “please fold the towel”) and capturing spatial relationships from high-definition visual inputs.

  • Action Execution Cerebellum (Action Expert): To generate high-frequency and smooth movements, a multi-layer Diffusion Transformer (DiT) is embedded. Instead of outputting a single action, it generates an “Action Chunk” and ensures precision using flow-matching technology.

Training Strategy: Preventing “Dumbing Down”

Many VLA models tend to lose their general understanding capabilities while learning actions. Xiaomi addresses this with a hybrid training method combining multimodal data with action data:

  • VLM Synergistic Training: An “Action Proposal” mechanism forces the VLM to predict action distributions while understanding images, aligning the VLM’s feature space with the action space.

  • DiT Specialized Training: The VLM is frozen, and the DiT is trained to recover precise action sequences from noise, relying entirely on KV features for conditional generation.

Real-Time and Fluid Movements

To solve the “action stuttering” caused by inference latency in real robots, the team introduced innovative techniques:

  • Asynchronous Inference: Decouples the model’s reasoning process from the robot’s execution, allowing them to run asynchronously for smoother operation.

  • Clean Action Prefix: Uses the previously predicted action as input to ensure trajectory continuity and reduce jitter.

  • $\Lambda$-shape Attention Mask: A special attention mask forces the model to focus on current visual feedback rather than historical inertia, making the robot highly responsive to sudden environmental changes.

Performance & Availability

  • Benchmark Leader: The model achieved top results among 30 models in simulation benchmarks like LIBERO, CALVIN, and SimplerEnv.

  • Real-World Challenges: In tests with dual-arm robots, it demonstrated superior hand-eye coordination in long-horizon tasks like dismantling blocks and folding soft towels.

  • Hardware Compatibility: It supports real-time inference on consumer-grade graphics cards.

Xiaomi has made the project page, source code, and model weights available to the public:

Play Store icon
HyperOS Downloader Easily check if your phone is eligible for HyperOS 3.0 update!
Download icon

Leave a Reply

Your email address will not be published. Required fields are marked *

Poll
Which name did you like better, MIUI or HyperOS?