
Compute Where it Counts: Self Optimizing Language Models
Researchers propose Self-Optimizing Language Models, a technique that dynamically allocates compute across decoding steps rather than applying uniform compression budgets. A lightweight policy network learns to adjust token-level attention sparsity and MLP pruning based on hidden state difficulty, addressing a fundamental inefficiency in current inference optimization: easy tokens waste compute while hard ones starve. This shifts the inference optimization paradigm from static compression toward adaptive, learned allocation, potentially unlocking significant speedups without retraining frozen base models.62





















