Google's Gemma 4 open AI models use "speculative decoding" to get up to 3x faster

Google's Gemma 4 deployment of speculative decoding represents a meaningful efficiency breakthrough in open-weight model inference. The technique generates candidate tokens in parallel using a smaller draft model, then validates them against the full model, achieving 3x throughput gains without quality degradation. This matters because inference speed directly impacts cost and user experience at scale. For practitioners, it signals that open models can now compete with proprietary systems on latency without sacrificing accuracy, potentially shifting deployment economics across edge and cloud environments.
Modelwire context
Analyst takeSpeculative decoding is not a new technique, it has existed in research literature for years. What's actually new is Google shipping it as a default inference optimization in an open-weight release, which means the efficiency gain is available to any practitioner pulling the model, not just teams with the engineering resources to implement it themselves.
This connects directly to the pattern Modelwire flagged in early May when covering Xiaomi's MiMo-V2.5-Pro: the open-weight competition is shifting from raw capability benchmarks toward operational economics, specifically cost-per-inference and latency at deployment. Gemma 4's throughput gains reinforce that thesis. Where Xiaomi attacked the token efficiency angle, Google is attacking the inference speed angle. Both moves pressure closed-weight providers on the same axis: the total cost of running a capable model in production. The two stories together suggest a coordinated (if unintentional) market squeeze on proprietary API pricing.
Watch whether Mistral or Meta follow Gemma 4's lead by shipping speculative decoding as a bundled default in their next open-weight releases within the next two quarters. If they do, inference speed stops being a differentiator and becomes table stakes, which forces the competition back onto quality and context length.
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsGoogle · Gemma 4 · speculative decoding
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arstechnica.com. If you’re a publisher and want a different summarization policy for your work, see our takedown page.