📑 Table of Contents

'Background Temperature' Concept Reveals Hidden Randomness in Large Language Models

📅 · 📁 Research · 👁 10 views · ⏱️ 8 min read
💡 Researchers introduce the concept of 'background temperature,' formally quantifying the phenomenon where large language models still produce inconsistent outputs at temperature T=0, revealing hidden randomness caused by underlying implementation factors such as floating-point arithmetic and batch size, opening new directions for model reproducibility research.

Introduction: Temperature at Zero, Yet Outputs Remain Uncertain?

When using large language models (LLMs), many developers hold an intuitive assumption — setting the decoding temperature to T=0 should make the model produce perfectly consistent outputs for identical inputs. In reality, however, even under this most "deterministic" setting, LLMs still exhibit the puzzling phenomenon of output divergence. This long-overlooked issue has now been formally incorporated into a theoretical framework by a newly published research paper on arXiv (arXiv:2604.22411v1). The researchers introduce the novel concept of "Background Temperature" (T_bg), aiming to quantify and characterize the hidden sources of randomness in large language models.

Core Concept: What Is 'Background Temperature'?

Background temperature is a formalized metric proposed by researchers to describe the equivalent randomness that actually exists in LLMs under ostensibly deterministic decoding strategies. In short, even when users explicitly set the sampling temperature to zero, an effective non-zero "temperature" T_bg still exists within the model system, originating from multiple non-deterministic factors in the underlying computational implementation.

The paper cites recent research from Thinking Machines Lab, summarizing three key sources of this hidden randomness:

  • Batch-size Variation: Different inference batch sizes lead to divergent computation paths on GPUs, thereby affecting final numerical results. Even with completely identical inputs, changing the batch size can produce different outputs.

  • Kernel Non-invariance: The underlying CUDA compute kernels on GPUs may select different execution strategies under different hardware configurations, driver versions, or even different invocation timings, causing subtle deviations in computation results.

  • Floating-point Non-associativity: This is the most fundamental mathematical-level cause. Floating-point arithmetic does not satisfy the associative property — that is, (a+b)+c does not necessarily equal a+(b+c). In large-scale parallel computing, different summation orders lead to different accumulation paths for rounding errors, ultimately producing rank flips near the peaks of softmax probability distributions.

Background temperature T_bg is a comprehensive quantification of all the above non-deterministic factors. When T_bg approaches zero, the system behaves close to true determinism; when T_bg is high, it indicates significant hidden randomness and a substantial decline in output reproducibility.

In-Depth Analysis: Why Is This Concept So Important?

Direct Impact on Engineering Practice

In production environments, output reproducibility is a foundational requirement for many critical applications. Scenarios such as medical diagnostic assistance, legal document generation, and financial risk analysis all demand that models deliver consistent responses for identical inputs. However, the existence of background temperature means that merely setting T=0 does not guarantee deterministic output. Engineering teams need to recognize that true reproducibility requires control at deeper levels of the computing environment — including fixing GPU models, locking CUDA versions, unifying batch sizes, and even using deterministic computation modes (such as PyTorch's deterministic flag).

Far-Reaching Impact on Model Evaluation

In academic research, model evaluation typically relies on greedy decoding at T=0 to obtain "standard answers." But if background temperature is non-negligible, then the same benchmark question may produce different answers across different runtime environments, causing fluctuations in evaluation scores. This poses a serious challenge to the fairness and comparability of leaderboards. Researchers may need to explicitly report hardware environments in evaluation protocols, or even perform multiple sampling runs and use statistical results to obtain truly reliable performance assessments.

Complementing Theoretical Understanding

From a theoretical perspective, the background temperature concept elegantly elevates an implementation-level engineering problem into a mathematically formalizable object. This allows researchers to combine the explicitly set sampling temperature T with the implicitly existing background temperature T_bg within a unified temperature parameter framework, building more precise output distribution models. For example, the model's actual effective temperature can be expressed as T_eff = T + T_bg, providing a more complete theoretical tool for understanding and predicting LLM behavior.

Industry Response and Existing Countermeasures

In fact, the industry has already been paying attention to LLM non-determinism. OpenAI acknowledged in its API documentation that even setting temperature=0 cannot fully guarantee output consistency, and introduced the seed parameter to improve reproducibility as much as possible. NVIDIA also provides deterministic mode options in its TensorRT-LLM inference framework, though often at the cost of inference speed.

The contribution of this paper lies not in simply cataloging these engineering phenomena but in providing a unified theoretical perspective. Through the abstract concept of background temperature, non-determinism from different sources is brought into the same analytical framework, facilitating systematic assessment and mitigation.

Outlook: Toward Truly Controllable AI Systems

The introduction of the background temperature concept marks a shift in the AI community's understanding of large model reproducibility — from "known unknowns" to "quantifiable knowns." Going forward, we can anticipate developments in several directions:

First, the establishment of standardized measurement methods. The research community may develop standardized T_bg measurement tools and benchmarks to help developers quickly assess background temperature levels in specific deployment environments.

Second, optimization at the hardware and framework level. Chip manufacturers and deep learning framework developers may place greater emphasis on supporting deterministic computation, reducing background temperature while maintaining high performance.

Finally, adaptive strategies at the application level. In scenarios where background temperature is non-negligible, application layers can employ strategies such as multiple sampling with voting and consistency checks to hedge against the effects of hidden randomness, thereby improving output reliability without sacrificing inference efficiency.

Although this paper is relatively brief, the conceptual framework it proposes carries broad inspirational significance. As large models increasingly penetrate critical decision-making domains, understanding and quantifying this "invisible randomness" is an essential step toward building truly trustworthy AI systems.