Research Hardware & Infra·arXiv cs.LG·May 3

SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving

SplitZip addresses a critical infrastructure bottleneck in disaggregated LLM serving: the transfer of KV cache between prefill and decode workers across physical systems. As production deployments scale to handle longer contexts and agentic workloads, this transfer latency directly impacts end-to-end serving performance. The paper proposes a lossless compression codec optimized for online use, targeting the gap left by existing offline-focused compression schemes that either run on CPU or use variable-length encoding unsuitable for real-time inference. This work matters to anyone operating multi-machine LLM clusters where memory bandwidth, not compute, is the constraint.

Modelwire context

Explainer

The key detail the summary gestures at but doesn't unpack: existing compression schemes fail here not because they're slow, but because variable-length output breaks the fixed-memory assumptions that GPU-side decode workers depend on. SplitZip's contribution is specifically a fixed-length output codec that runs on GPU, which is a narrow but genuinely hard constraint to satisfy simultaneously.

This sits in a cluster of recent work attacking KV cache from different angles. The LightKV paper from May 1st ('Make Your LVLM KV Cache More Lightweight') targets the same resource class, GPU memory pressure during inference, but approaches it through token-level redundancy elimination rather than transfer compression. Together they suggest the field is converging on KV cache as the primary optimization surface for production inference, with different papers carving out different points in the pipeline: what you store, how you compress it, and now how you move it between machines.

The practical test is whether SplitZip's compression ratios hold on long-context workloads above 32k tokens, where KV cache transfer costs are highest and where the fixed-length constraint is hardest to satisfy without padding overhead eating the gains. If a follow-up benchmark from a production operator (Anyscale, Together, or similar) reproduces the throughput numbers on real traffic distributions, the approach is credible at scale.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsSplitZip

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.