Research Tools & Code·arXiv cs.LG·1d ago

GSRQ: Gain-Shape Residual Quantization for Sub-1-bit KV Cache

Researchers propose Gain-Shape K-means, a refinement to residual quantization that addresses a geometric flaw in standard clustering for KV cache compression. The core insight targets centroid shrinkage in high dimensions, which degrades directional fidelity during vector quantization. This work directly tackles a bottleneck in deploying LLMs with long context windows by enabling sub-1-bit KV cache storage, reducing memory overhead that currently scales linearly with sequence length. For production systems running extended-context inference, even marginal improvements in quantization efficiency compound across billions of tokens.

Modelwire context

Explainer

The 'sub-1-bit' framing deserves unpacking: this is achieved through residual quantization stacking multiple codebook lookups, where each successive stage compresses the error left by the previous one, so the effective bits-per-element across the chain falls below one. The geometric fix here targets a specific failure mode where high-dimensional centroids collapse toward the origin, distorting the directional information that attention mechanisms actually depend on.

This connects directly to the RAG context-budget work covered in 'What Survives Into Context,' which framed token budgets as hard production constraints rather than soft targets. KV cache compression is the supply-side answer to that same constraint: if you can store more context history in the same memory envelope, the retrieval and packing problems become less acute. Both papers are essentially attacking the same long-context bottleneck from opposite ends. The RLHF staleness-scaling coverage is less directly relevant here, though both touch on the compounding costs of running large models at throughput.

Watch whether implementations of GSRQ appear in inference frameworks like vLLM or SGLang within the next two quarters. Adoption there would confirm the technique is robust enough for production memory profiles, not just controlled benchmark conditions.

Coverage we drew on

What Survives Into Context: A Diagnostic for Budget-Constrained Multi-Hop RAG and When Submodular Evidence Packing Improves It · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGain-Shape K-means · Residual Quantization · KV Cache · Vector Quantization

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Research