The max_tokens
parameter specifies the maximum number of tokens that can be generated in the chat completion.
With that being said, the token count of your prompt plus max_tokens
cannot exceed the model’s context length.
Every model has a context length. GPT-4o has 128,000, Claude 3.5 Sonnet has 200,000 context length. This number cannot be exceeded.
Each of these models have output size between 4,096 to 16,384 tokens — which means that the output can’t exceed this range.
However, you can use the max_tokens
parameter to specify exactly how many tokens the model should reserve from the default number of output tokens.
To set this parameter simply add the token count that you’d like to see in the model’s response.
For chat completions you can skip setting this parameter, and the model will automatically use what’s left from the context length.
However, there are times when you’ll want to limit the length of the output. In those cases, it’s important to have a good way to measure how long the input prompt will be, so you can prevent the output from getting cut off. There are two scenarios for this:
You can use the max_tokens parameter in cases where you’d want to control the length of the output. A
You can set a lower token count, because you’d want your chatbot to answer in a shorter, conversational manner.
You can set a lower token count in cases where you want to prevent the model from continuing its output endlessly, especially if you’re working with high temperature settings that encourage creativity but can lead to verbose responses.
You can also optimize how fast the model responds to a real-time feature in the app by limiting the size of the output.