When are likely answers right? On Sequence Probability and Correctness in LLMs

A new study quantifies the relationship between sequence probability and correctness across decoding methods, models, and benchmarks, revealing when LLMs' internal likelihood estimates actually predict accurate outputs. The research tests this alignment at multiple granularities: across decoding strategies, hyperparameter tuning, individual prompt-answer pairs, and repeated generations. This work matters because most modern sampling and beam-search techniques assume higher probability correlates with better answers, yet the assumption remains largely unvalidated at scale. Understanding where this breaks down could reshape how practitioners select decoding methods and inform better confidence calibration for production systems.

Modelwire context

Explainer

The buried issue here is calibration, not just decoding strategy. Most production systems treat high sequence probability as a proxy for reliability, and this paper is one of the first to stress-test that proxy systematically across benchmarks rather than on a single task or model family.

This connects most directly to the RiVER paper covered the same day ('Reinforcement Learning without Ground-Truth Solutions can Improve LLMs'), which sidesteps the probability-correctness problem entirely by using execution feedback as reward rather than relying on model confidence. That framing is telling: if RL practitioners are already building around the assumption that internal likelihood is an unreliable training signal, this paper provides the empirical scaffolding that explains why. The two pieces together suggest a quiet consensus forming around the limits of sequence probability as a meaningful guide, whether for training or inference.

Watch whether any of the major inference frameworks (vLLM, TGI) incorporate calibration-aware decoding options within the next two release cycles. If they do, this line of research is being treated as actionable; if not, it stays academic.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLarge Language Models · Decoding Methods · Sequence Probability · Beam Search · Token-level Sampling

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.