Xiaomi MiMo hits 1,200 tokens per second on commodity GPUs

iEXExchanger
Xiaomi MiMo hits 1,200 tokens per second on commodity GPUs

Xiaomi and TileRT have hit 1,000-plus tokens per second on a trillion-parameter model using standard commodity GPUs, not custom chips. The technique: FP4 quantization and DFlash block decoding. Weights are open-sourced.

Xiaomi and inference partner TileRT have broken the 1,000 tokens-per-second barrier on a trillion-parameter model — using eight standard commodity GPUs, no custom silicon required. Peak demonstrations showed roughly 1,200 tokens per second. GPT-5.5 delivers around 68 tokens per second; Claude Opus 4.6 manages about 71. The gap is 15 to 17 times.

Two techniques stack to produce the result. First: FP4 quantization on the model's expert layers — compressing weights down to 4-bit precision in the mixture-of-experts components, while the rest of the model stays at full precision. Near-zero quality loss, dramatic reduction in memory bandwidth pressure. Second: DFlash speculative decoding, which fills an entire block of token positions in a single forward pass instead of going one at a time, then verifies the whole batch at once. In coding tasks, 6.3 out of every 8 proposed tokens are accepted. The TileRT inference engine keeps the entire compute graph resident inside the GPU, cutting per-operator launch overhead entirely.

In practice: dozens of simultaneous parallel inferences without perceptible delay, real-time code generation, and genuine applicability where latency costs money — trading, clinical decision support, agentic pipelines. A limited API trial runs through June 23 at three times standard MiMo pricing; the underlying MiMo-V2.5-Pro-FP4-DFlash checkpoint is already open-sourced on Hugging Face.

The more interesting question here isn't the number itself. Groq built a custom chip. Cerebras etched a wafer-scale processor. Xiaomi hit 1,200 tokens per second with hardware you can rack today. If algorithmic improvements on commodity silicon can close the gap this fast, the premise that you need bespoke accelerators to win the inference race looks shakier than it did last week.

Questions and answers

Frequently asked questions about this article

What is Xiaomi MiMo-V2.5-Pro-UltraSpeed?

It's a speed-optimized version of Xiaomi's trillion-parameter MiMo language model, achieving over 1,000 tokens per second on a standard eight-GPU commodity node.

How does FP4 quantization work?

FP4 compresses expert-layer weights down to 4-bit precision, dramatically reducing memory bandwidth while keeping other model components at full precision — resulting in near-zero quality loss.

What is DFlash speculative decoding?

Instead of generating tokens one at a time, DFlash proposes a full block of tokens in a single forward pass and verifies the batch at once, accepting 6.3 out of 8 proposed tokens on average in coding tasks.

Can I use the model right now?

The model weights are already open-sourced on Hugging Face. A limited API trial runs June 9–23, 2026, by application at 3× standard MiMo pricing.

Why does inference speed matter this much?

High inference speed enables dozens of parallel agentic processes without delay — critical for trading, healthcare, and complex agent pipelines where every second of latency has a real cost.