Xiaomi and inference partner TileRT have broken the 1,000 tokens-per-second barrier on a trillion-parameter model — using eight standard commodity GPUs, no custom silicon required. Peak demonstrations showed roughly 1,200 tokens per second. GPT-5.5 delivers around 68 tokens per second; Claude Opus 4.6 manages about 71. The gap is 15 to 17 times.
Two techniques stack to produce the result. First: FP4 quantization on the model's expert layers — compressing weights down to 4-bit precision in the mixture-of-experts components, while the rest of the model stays at full precision. Near-zero quality loss, dramatic reduction in memory bandwidth pressure. Second: DFlash speculative decoding, which fills an entire block of token positions in a single forward pass instead of going one at a time, then verifies the whole batch at once. In coding tasks, 6.3 out of every 8 proposed tokens are accepted. The TileRT inference engine keeps the entire compute graph resident inside the GPU, cutting per-operator launch overhead entirely.
In practice: dozens of simultaneous parallel inferences without perceptible delay, real-time code generation, and genuine applicability where latency costs money — trading, clinical decision support, agentic pipelines. A limited API trial runs through June 23 at three times standard MiMo pricing; the underlying MiMo-V2.5-Pro-FP4-DFlash checkpoint is already open-sourced on Hugging Face.
The more interesting question here isn't the number itself. Groq built a custom chip. Cerebras etched a wafer-scale processor. Xiaomi hit 1,200 tokens per second with hardware you can rack today. If algorithmic improvements on commodity silicon can close the gap this fast, the premise that you need bespoke accelerators to win the inference race looks shakier than it did last week.



