Research Tools & Code·arXiv cs.LG·4d ago

XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

XFP introduces a fundamentally different approach to LLM weight quantization by inverting the typical workflow: instead of engineers choosing bit-widths and calibration strategies upfront, the system accepts quality targets per layer and automatically determines codebook size, outlier budgets, and compression ratios. By separating sparse high-magnitude weights as fp16 residuals and packing the remainder into learned per-group codebooks, XFP eliminates manual tuning and Hessian computation while achieving competitive decode throughput on 122B-parameter models. This shift toward specification-driven quantization could reshape how practitioners approach inference optimization, particularly for mixture-of-experts architectures where layer heterogeneity demands adaptive strategies.

Modelwire context

Explainer

The genuinely underappreciated detail here is that XFP sidesteps Hessian computation entirely, which is the expensive calibration step that makes methods like GPTQ slow to apply at scale. That omission is doing a lot of work in the paper's efficiency claims and deserves more scrutiny than the throughput numbers alone.

The related coverage this week is largely disconnected from XFP's concerns. The trajectory-forecasting paper on NBA movement and the trust-region optimizer for distributed training both touch neural network methodology, but neither addresses inference-time compression or the specific pressures of deploying hundred-billion-parameter models. XFP belongs to a separate and active thread in the field: making large models cheap enough to serve without retraining, a problem that has grown more acute as mixture-of-experts architectures like the Qwen3.5-122B-A10B referenced here become more common. The heterogeneity of MoE layers is precisely what makes fixed-bit quantization brittle, and that is the practical gap XFP is targeting.

Watch whether independent reproducers can match the reported throughput on non-Qwen MoE architectures within the next few months. If the quality-target specification holds across architectures with different sparsity profiles, the Hessian-free claim becomes much more credible.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsXFP · Qwen3.5-122B-A10B · Lloyd · MoE

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.