Google says multi-token prediction approach warming up Gemma 4 inference s

The update targets developer workstations, mobile devices and cloud environments

by Defused News Writer

Updated May 05, 2026

Google says multi-token prediction approach warming up Gemma 4 inference s

Google has announced a performance-focused update to Gemma 4, its open-source artificial intelligence model family, using multi-token prediction drafters to speed up inference times across a range of hardware.

The technique, which allows a model to predict several output tokens simultaneously rather than one at a time, is designed to reduce the latency users experience when running Gemma 4 on developer workstations, mobile devices and cloud infrastructure.

Google framed the update as a follow-up to the Gemma 4 launch rather than a standalone product release.

"Just a few weeks ago, we introduced Gemma 4, our most capable open models to date," the company said in a blog post announcing the change.

The company pointed to rapid early adoption as evidence of demand, claiming more than 60 million downloads in the first few weeks following Gemma 4's initial release.

"Gemma 4 is delivering unprecedented intelligence-per-parameter to developer workstations, mobile devices and the cloud," it said.

Multi-token prediction has attracted growing interest across the AI industry as developers look for ways to make large language models faster and cheaper to run without sacrificing output quality.

The approach works by training a smaller "drafter" model to propose multiple tokens at once, which the main model then verifies, cutting the number of sequential generation steps required and reducing overall response times.

For developers running models locally on laptops or deploying them to mobile applications, where computational resources are more constrained than in data centre environments, such speed gains can make a material difference to usability.

Google did not publish specific benchmark figures alongside the announcement, nor did it provide a detailed rollout timeline for the update.

The absence of hard performance numbers means it is difficult to assess the scale of the improvement or how it compares with competing inference optimisation techniques used by rival model providers.

The update arrives as competition intensifies among AI companies to make open-weight models more practical for real-world deployment, with Meta's Llama family and Mistral's models among those vying for developer attention.

Speed and efficiency have become increasingly important differentiators as the initial race to build the largest and most capable models gives way to a focus on making existing architectures run well on everyday hardware.

Google's decision to emphasise breadth of deployment, spanning workstations, phones and cloud servers, signals an intent to position Gemma as a generalist open model rather than one optimised for a single use case.

Whether the multi-token prediction drafters deliver meaningful gains in practice will depend on benchmarks and developer feedback that have yet to materialise publicly.