Even with temperature set to zero, LLMs may not always produce identical answers because non-determinism arises from the inference process itsel
Even with temperature set to zero, LLMs may not always produce identical answers because non-determinism arises from the inference process itself, not just randomness. Traditionally, GPU parallelism and floating-point rounding errors were blamed. However, research from Thinking Machines Lab identifies the deeper cause as a “lack of batch invariance.” When servers group multiple requests, they change execution strategies depending on batch size or composition. These shifts subtly alter numerical results, and small differences accumulate into divergent outputs. The effect is especially clear in matrix multiplications, attention, and RMSNorm. The solution is to enforce batch-independent kernels, but this improves reproducibility at the cost of reduced performance.
https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
Comments
Post a Comment