Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models

Thinking Machines Lab formalizes why LLMs produce different outputs even at temperature zero, introducing the concept of background temperature to quantify implementation-level nondeterminism from batch sizes, kernel variance, and floating-point arithmetic. The work proposes an empirical protocol to measure this hidden randomness across inference environments.

Modelwire context

Explainer

The practical implication buried in this work is that two deployments of the same model, at the same nominal temperature setting, can produce meaningfully different outputs depending on hardware, batch configuration, and floating-point implementation choices, meaning reproducibility claims in published benchmarks may be quietly overstated.

This connects most directly to the watermarking paper covered the same day ('SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking'), which identified how token probability distributions interact with inference-time behavior in ways the field has underestimated. If background temperature varies across inference environments, watermark detection schemes that assume stable logit distributions face an additional confound that the SSG authors did not account for. More broadly, the hidden nondeterminism documented here is relevant to any evaluation pipeline that treats temperature-zero outputs as deterministic ground truth, which is a common assumption across nearly all the benchmark-reliant work in this archive.

Watch whether major inference providers (Hugging Face, Together AI, Replicate) respond by publishing background temperature measurements for their own serving stacks within the next few months. If they do not, that silence itself signals how uncomfortable this finding is for reproducibility guarantees they have already made to enterprise customers.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsThinking Machines Lab

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.