The temperature
parameter controls the randomness of the generated text. Adjusting the temperature changes how the model selects the next word in a sequence, influencing the creativity and predictability of the output. Low temperatures render outputs that are predictable and repetitive. Conversely, high temperatures encourage LLMs to produce more random, creative responses.
Temperature is one of the most important controls for developers. Some AI-powered applications need predictable outputs; others needs varying, chaotic results to deliver on a creativity-oriented product. Using temperature, developers can fine-tune LLMs to produce the ideal results.
Temperature is a widely-supported and popular feature to tweak outputs. It is distinct from other controls like Top P and Top K. Those controls filter which tokens should be considered based on their raw likelihoods. Temperature instead determines how an LLM will leverage those likelihoods.
LLMs work in piecemeal fashion, iteratively generating tokens. Tokens are the smallest unit of LLMs. They either represent short words (e.g. the
) or parts of words (e.g. bal
and loon
comprises balloon
). For every subsequent token, the possible options are assigned likelihoods.
Before generating each token, an LLM will consider different options. These options have weighted likelihoods. For example, imagine an output: I went to the zoo and saw
.
For the next token, an LLM might consider likely options like lions (0.2)
, hipp (0.13)
(starting token for hippo), and croc (0.11)
(crocodile). It’ll also consider less likely options like meer (0.004)
(meerkat), mice (0.003)
, and squid (0.003)
.
Temperature regulates how an LLM weights these likelihoods. A moderate or default temperature will treat them at face value. A low temperature will optimize for higher likelihoods—increasing the net probability of lion
or hipp
. A high temperature will do the opposite, boosting the odds of meer
and mice
.
This control happens through a process called SoftMax, which helps the model decide which word is the best fit based on probability.
When the model is deciding what words (or tokens) to pick next, it looks at raw scores called **logits.
**The SoftMax function function then converts this set of raw scores (logits) into probabilities. It basically takes a vector of numbers and converts them into a probability distribution, where the sum of all probabilities is equal to 1.
This allows the model to make decisions based on the relative likelihood of different token options.
Let’s look at an example:
Imagine an LLM is trying to predict the next word in a sentence. The model might assign the following logits to three possible words:
“cat” = 2.0
“dog” = 1.5
“fish” = 0.5
The SoftMax function converts these into probabilities:
“cat” = 0.57
“dog” = 0.31
“fish” = 0.12
The model is most likely to choose “cat” because it has the highest probability.
So with the temperature parameter this translates to:
Low temperature: Makes the SoftMax distribution sharper, favoring the highest probability words more strongly (less randomness).
High temperature: Flattens the distribution, making less likely words more probable (more randomness).
It is fairly easy to set temperature on modern LLM APIs.
You can set temperature on OpenAI’s Chat Completions API with the optional temperature
parameter. temperature
ranges from 0.0
to 2.0
, and defaults to 1.0
.
You can set temperature on Anthropic’s Messages API with the optional temperature
parameter. temperature
ranges from 0.0
to 1.0
, and defaults to 1.0
. Notably, Anthropic doesn’t allow temperature to pass 1.0
.
You can set temperature on Gemini’s API with the optional temperature
parameter. temperature
ranges from 0.0
to 2.0
, and defaults to 1.0
.
0
result in non-determinism?A common case of confusion is if a temperature of 0
generates non-deterministic replies. In theory, yes. In practice, no.
As noted by this OpenAI Forum Thread, achieving non-determinism is impossible. A temperature of 0
does force the SoftMax function to choose the most likely response—which is the definition of greedy sampling and is non-deterministic. However, LLMs are not run in a vacuum; race conditions of multi-threaded code impacts the established likelihoods of tokens. Consequently, while temperature reduces randomness to a minimum, it doesn’t eliminate them.
However, the randomness is minimized to the extent that developers can expect near non-determinism. For most queries that specify the structure of the expected output, this reduction in randomness is sufficient.
There are three primary types of temperature settings.
1.0
)On most LLMs, a low temperature is a temperature below 1.0
. This will result in more robotic text with significantly less variance. This is ideal for applications that prioritize predictability.
1.0
)A temperature of 1.0
serves as a benchmark of average randomness and creativity. This is the default setting on most LLMs, and many applications will consequently use a temperature of 1.0
.
1.0
)A temperature above 1.0
increases the “creativity” of a model by adding more randomness to the outputs. There are always significantly more low likelihood tokens than high likelihood tokens; therefore, by tippin
Experimenting with temperature is relatively straightforward. First, understand the default temperature (typically a temperature of 1.0
). Then, determine if you want to introduce more consistency or chaos.
If you chose a temperature close to 0.0
, you’ll get very predictable responses. Conversely, if you choose a temperature closer to 2.0
, you’ll see the LLM ramble like Shakespeare.
An extremely high temperature is rarely useful in production, unless the product is trying to deliver on esoteric responses. However, slightly higher temperatures than 1.0
can lead to more elevated or complex thinking, which is sometimes necessary. Conversely, for applications that need consistent responses for the same prompts, a temperature at or close to 0.0
is entirely appropriate.
Temperature is one of the most common tweaked settings when programmatically interfacing with an LLM.
There are some scenarios where temperature is particularly helpful:
There are some other settings that provide similar, but distinct features to temperature. These could serve as good alternatives in niche scenarios.
Top P, also known as nuclear sampling, filters the tokens that should be considered for each iteration. Top P defines the probabilistic sum that the chosen token’s likelihoods should add up to. Notably, Top P is not a percentage of tokens; a Top P of 10% defines the minimum quantity of tokens needed to sum to 10% of the net likelihoods.
Top P can be mixed with temperature, but that is typically not advisable. While temperature is typically used over Top P, Top P is useful when token options aren’t as long-tailed.
Top K is similar to Top P, but instead defines a quantity of the most probable tokens that should be considered. For example, a Top K of 3
would instruct the LLM to only consider the three most likely tokens. A Top K of 1
would force the LLM to only consider the most likely token.
A low Top K is similar to a low temperature. However, Top K is a more crude metric because it doesn’t account for the relative probabilities between the options. It’s also not as well-supported, notably missing from OpenAI’s API.