The JSON_mode
parameter ensures that the models will always output a valid JSON output as a response to your prompt.
LLMs are really great at handling complex language tasks, but their responses are often unstructured, which can be frustrating for developers who prefer structured data. To extract information from these unstructured outputs you need to use RegEx or prompt engineering — thus slowing the development process.
So, if you enable JSON mode for supported models like OpenAI and Gemini, the models will consistently return the output as a structured JSON object.
To turn on JSON mode with the Chat Completions or Assistants API you can set the response_format
to { "type": "json_object" }
. If you are using function calling, JSON mode is always turned on.
Important notes:
To enable your Gemini models to output valid JSON-responses, you can supply a schema to the model:
Read more here.
For chat completions you can skip setting this parameter, and the model will automatically use what’s left from the context length.
However, there are times when you’ll want to limit the length of the output. In those cases, it’s important to have a good way to measure how long the input prompt will be, so you can prevent the output from getting cut off. There are two scenarios for this:
You can use the max_tokens parameter in cases where you’d want to control the length of the output. A
You can set a lower token count, because you’d want your chatbot to answer in a shorter, conversational manner.
You can set a lower token count in cases where you want to prevent the model from continuing its output endlessly, especially if you’re working with high temperature settings that encourage creativity but can lead to verbose responses.
You can also optimize how fast the model responds to a real-time feature in the app by limiting the size of the output.