Morning Overview

Google’s new speed trick makes its open AI models run 3x faster without losing a single point of accuracy

A team of Google researchers has published a technique that could let developers squeeze roughly three times more throughput out of large language models on the same hardware, with what the authors describe as zero accuracy loss. The method, called TurboQuant, targets one of the biggest memory bottlenecks in AI text generation: the key-value (KV) cache, a data structure that stores the results of prior computations so a model doesn’t have to redo them for every new word it produces.

The paper, posted to arXiv in April 2025, arrives as companies and open-source developers race to run powerful models on smaller, cheaper hardware. If the results hold up under independent testing, TurboQuant could meaningfully reduce the cost of serving chatbots, coding assistants, and other AI tools that depend on long-context generation.

How TurboQuant works

At its core, TurboQuant is a two-stage compression pipeline. The first stage applies a mean-squared-error quantizer to the KV cache, shrinking each stored value from the standard 16-bit floating-point format down to a handful of bits. This step builds on a prior Google technique called PolarQuant (published in February 2025), which converts cache entries into polar coordinates so their angular distribution can be quantized cleanly. PolarQuant alone reported more than 4.2x KV-cache compression with strong scores on long-context benchmarks.

The second stage adds a 1-bit correction layer based on the Quantized Johnson-Lindenstrauss (QJL) transform, a mathematical technique that uses random projections to preserve the relationships between data points even after aggressive compression. This residual step compensates for the small biases introduced by the first stage’s coarse quantization, without requiring the full-precision offsets that would eat into memory savings. The QJL paper, published in June 2024, demonstrated more than 5x KV-cache memory reduction at 3 bits and released a working CUDA implementation for NVIDIA GPUs.

Combined, TurboQuant achieves what the authors call “absolute quality neutrality” at roughly 3.5 bits per channel. In plain terms: the compressed model’s outputs are statistically indistinguishable from the uncompressed version’s, at least across the benchmarks tested. Since the KV cache often dominates GPU memory during long text generation, shrinking it by roughly 4.5x means developers can process longer prompts, serve more users simultaneously, or run models on hardware that would otherwise be too small.

What “3x faster” actually means

The speed claim deserves careful unpacking. TurboQuant does not make the model’s core math run three times faster. Instead, it dramatically reduces the memory the KV cache consumes, which relieves a bottleneck that often limits how quickly a GPU can generate text. In workloads where memory bandwidth is the constraining factor, freeing up that space can translate into roughly 3x higher throughput, meaning the system handles three times as many tokens per second.

But end-to-end speed depends on many variables: model size, attention implementation, the serving framework, and whether the workload is actually memory-bound in the first place. The underlying papers emphasize compression ratios (4.2x, 5x) rather than comprehensive wall-clock latency benchmarks on production systems. For a startup running a 7-billion-parameter model on a single consumer GPU, the gains could be transformative. For a large cloud deployment already optimized with custom kernels and high-bandwidth interconnects, the improvement may be more modest.

Where the evidence stands as of June 2025

The strongest claims in the TurboQuant paper have not yet been replicated by independent labs. The 3.5-bit quality-neutrality result comes entirely from the authors’ own benchmarks, which cover a specific set of models, datasets, and evaluation metrics. No third-party evaluation has confirmed that the same compression level holds across different architectures, instruction-tuned variants, or real-world workloads like retrieval-augmented generation or multi-document summarization, tasks that can produce token distributions quite different from standard benchmarks.

Google has not publicly tied TurboQuant to any specific product, service, or model release. There is no indication it has been integrated into widely used inference runtimes or official toolchains. For now, TurboQuant should be treated as a research contribution, not a production-ready feature.

It is also worth noting what TurboQuant does not address. It compresses the KV cache, but the KV cache is only one component of a model’s total memory footprint. Weight quantization, a separate and more mature field with established tools like GPTQ and AWQ, targets the model’s parameters themselves. TurboQuant and weight quantization solve different problems, and combining them is an open engineering challenge.

A related but distinct technique, speculative decoding, attacks inference speed from the computation side rather than the memory side. In speculative decoding, a smaller “draft” model proposes several tokens at once, and the full-size model verifies or rejects them in a single pass. Because the verification preserves the original model’s output distribution, the final text is identical to what the larger model would have produced on its own, just generated with fewer forward passes. The foundational version of this approach was described by Yaniv Leviathan, Matan Kalman, and Yossi Matias of Google Research in their 2023 paper “Fast Inference from Transformers via Speculative Decoding,” published at the International Conference on Machine Learning. In principle, pairing speculative decoding with TurboQuant-style cache compression could compound the gains, but no published experiments have tested the two together, and the engineering complexity of layering them is nontrivial.

What developers can do right now

For practitioners who want to experiment, the most concrete starting point is the QJL paper’s open-source CUDA kernel. Teams running memory-constrained inference on NVIDIA hardware can integrate it into their attention implementations, measure the effects on throughput and output quality, and compare against uncompressed baselines. From there, pairing a PolarQuant-style first-stage transformation with a QJL-based residual can approximate TurboQuant’s full pipeline, though developers will need to tune hyperparameters for their specific models and context lengths.

As of June 2025, no public developer commentary or independent benchmark results for TurboQuant have surfaced on major forums such as Hugging Face, Reddit’s r/LocalLLaMA, or MLCommons. The absence of community reaction is itself notable: it suggests the technique has not yet reached the hands-on experimentation phase where practitioners typically share results, file issues, or propose integrations. Until that feedback loop begins, the paper’s claims rest solely on its authors’ reported experiments.

The broader takeaway is that KV-cache quantization is maturing rapidly as a practical lever for inference efficiency. TurboQuant’s reported results suggest that substantial memory savings are achievable without obvious quality degradation, at least under controlled conditions. But until independent replications, broader open-source implementations, and production-scale benchmarks appear, the technique is best understood as a promising direction rather than a proven solution. Developers who explore it should validate both accuracy and performance in their own environments and be prepared to dial back compression if subtle regressions surface in their specific use cases.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.