Xiaomi Founder Lei Jun has officially announced a significant milestone for the company’s AI laboratory: multiple research papers from the Xiaomi team have been accepted to ICLR 2026 (International Conference on Learning Representations), one of the world’s premier artificial intelligence conferences.
The selected works cover critical frontiers in modern AI, including multimodal reasoning, reinforcement learning (RL), GUI agents, audio generation, and perhaps most notably, end-to-end autonomous driving.
Spotlight Research: DIPOLE (Dichotomous Diffusion Policy Optimization)
Among the accepted works, the paper titled “Dichotomous Diffusion Policy Optimization” stands out for its direct application to autonomous driving systems and large-scale decision-making models.
The Challenge: Stability vs. Complexity Diffusion-based strategies are currently the gold standard for generative tasks due to their high expressive power and controllability. However, applying them to Reinforcement Learning (RL) for decision-making creates a bottleneck:
- Direct Optimization often leads to training instability.
- Gaussian Approximations are computationally expensive and require excessive denoising steps, making them impractical for real-time applications like autonomous driving.
The Solution: The DIPOLE Algorithm
The Xiaomi research team (led by co-first authors Liang Ruiming, Zheng Yinan, et al.) proposes DIPOLE (Dichotomous Diffusion Policy Improvement).
- Core Logic: The algorithm re-examines the KL-regularized RL objective. Instead of a messy optimization, it introduces a “greedy policy regularization.”
- Binary Decomposition: It decomposes the optimal policy into a “binary” structure—one that maximizes rewards and one that minimizes them.
- Inference Control: During actual deployment, the system generates actions by linearly combining the probability scores of these two opposing policies. This allows the system to flexibly tune how “greedy” (aggressive) or conservative the decision-making should be.
Validation & Impact
The DIPOLE algorithm isn’t just theoretical. The paper validates its performance across three critical benchmarks:
- General RL: Significant improvements on standard benchmarks like ExORL and OGBench.
- Scalability: Successfully validated on VLA (Vision-Language-Action) models with parameter scales reaching 1 billion, proving it works on large-scale foundation models.
- Autonomous Driving: The algorithm demonstrated superior performance in NAVSIM, a real-world autonomous driving benchmark, signaling direct improvements for Xiaomi’s future Pilot systems.
Source: Lei Jun Weibo

Emir Bardakçı