Sep 3, 2024

Guides

Understanding Logprobs: What They Are and How to Use Them

Learn what OpenAI's logprobs are and how can you use them for your LLM applications

Authors:

No items found.

LLMs are like smart text predictors. For every word or phrase they generate, they consider several possible next words and decide how likely each one is.

For example, if the model is trying to complete the sentence: “The best movie of all time is…” it might consider options like “The Godfather” or “Citizen Kane.” However, a choice like “Cats” would likely get a very low probability, close to 0%—not to judge, but the visual effects in that one were pretty rough!

When working with model outputs, particularly in machine learning and natural language processing, we often deal with probabilities, which indicate how likely an event (like predicting a word or a label) is to happen.

However, instead of using the actual probability percentages directly (like 10%), we use the logarithm of these probabilities. This is called the “log probability” or “logprob.”

For example, a logprob of “-1” corresponds to a probability of about 10% (in a logarithmic scale), but it’s easier to work with in calculations. The more negative the logprob, the lower the probability. For instance, a logprob of “-3” indicates a much lower probability than “-1”.

The GPT playground showing the log probabilities

‍

Why Use Logprobs in Machine Learning?

We use logprobs because they make token prediction faster — and are easier for computers to work with.

It’s cheaper for computers to do addition than it is to do multiplication. Figuiring the next token is easier when you’re adding the log probabilities of each token, instead of multiplying their actual probabilities.
Optimizing the log probability (logprob) is more effective than optimizing the probability itself - the gradient (the direction and rate of change) of the logprob tends to be smoother and more stable, making it easier to optimize during training!

While primarily used by researchers to evaluate model performance, some providers like OpenAI are now offering this feature in their API, allowing users to adjust this parameter in their own LLM systems.

‍

How Does OpenAI’s Logprobs Parameter Work?

OpenAI introduced the logprobs parameter in their API in 2023:

When logprobs is enabled, the API returns the log probabilities for each output token, along with a few of the most likely tokens and their log probabilities at each position. Here are the key parameters:

logprobs: If set to true, the API returns log probabilities for each token in the output. Note that this option isn’t available for the gpt-4-vision-preview model.
top_logprobs: A number between 0 and 5 that specifies how many of the most likely tokens to return at each position, along with their log probabilities. The logprobs parameter must be true to use this option.

The higher the log probability — the higher the likelihood of a token being the correct choice in that context. You can easily understand how "confident" the model is in its output, and you can also check other potential responses the model considered.

Learn more about it here.

But, what can you do with it?

‍

What Can You Do with OpenAI’s Logprob Parameter?

You can leverage OpenAI’s logprobs to optimize your LLM in several ways, especially for tasks like classification, autocomplete, retrieval evaluation and minimizing hallucinations. You could use it in production as well as a moderation tool.

Let’s see some examples and how most of our customers utilize it for developing their AI features:

Evaluating Classification

Sometimes, we use LLMs to classify content. By default, the models pick the token with the highest probability. However, we can use the logprobs parameter to check if the model’s response meets a specific logprob threshold.

Let’s say that we want to classify a user query into three categories “Product Info”, “Pricing” and “Need to talk with an Agent”.

Let’s say that this is our prompt:

CLASSIFICATION_PROMPT = """
You will be given a user query.
Classify the query into one of the following categories: Product Info, Pricing, or Need to Talk with an Agent.
Return only the name of the category, and nothing else.
Ensure your output is one of the three categories mentioned.
User query: {query}
"""

And let’s say that these are some of our user questions:

user_queries = [
    "Can you provide details on the new smartphone model’s photo-editing features?",
    "What is the cost of the new smartphone model with advanced photo-editing features?",
    "How can I track my delivery status in real-time?",
]

If we run these queries with our LLM it’s very obvious that the first one will fall under “Product info”, the second under “Pricing”, and the third could potentially be labeled as “Product Info”, but with lower probability.

It’s easy for us to identify these simple examples just by looking at them — but if we want to scale this approach, we can adopt the logprobs parameter in the API to check whether a given classification satisfies a specific “threshold”.

Let’s see how GPT-4 Turbo will classify these queries:

For example, if the model classifies the query "How can I track my delivery status in real-time?" with less than 100% probability, we can automatically route it to our "Talk with an Agent" branch, or expand our categories to include options like "Delivery and Tracking" for more accurate classification.

Detecting RAG Hallucinations

In our RAG-based systems, we usually pull context dynamically in our prompts to fix hallucinations and give the model more information of our knowledge. But even with this context, the model can hallucinate if the answer is not provided in these documents.

This is because these models are built to always give an asnwer, even when they don’t have the right answer.

You can use logprobs as a filter to evaluate retrieval accuracy. By setting a threshold, you ensure that only responses with a logprob close to 100% are considered reliable. If the logprob is lower, it indicates that the answer may not be found in the documents.

Building an Autocomplete Engine

You can use logprobs to improve autocomplete suggestions as a user is typing. By setting a high confidence threshold, you can ensure that only the most likely and accurate suggestions are shown, avoiding less certain or irrelevant options.

This makes the autocomplete experience more reliable and helpful.

Moderation Filters

Logprobs can help us screen responses to avoid rude, offensive, or harmful content. By creating an LLM evaluator, we can classify queries and block those with 100% confidence if they meet negative criteria.

Token Healing

LLMs use tokens to process and generate text, which can sometimes lead to issues with how prompts are handled.

For example, if the model is unsure how to finish a given URL in a prompt, logprobs reveal which tokens it thinks are likely, helping you tweak the prompt to get better results.

Here’s a simple example:

If your prompt is The link is <a href="http:, and the model struggles, logprobs can show which completions it’s considering. If the logprobs suggest the model isn’t sure about finishing the URL, you might adjust the prompt to The link is <a href="http, which could make it more likely to generate a complete URL correctly.

Why is this the case?

When you end a prompt with “http:”, the model might not complete it correctly because it sees “http:” as a separate token and doesn’t automatically know that “://” should come next. But if you end the prompt with just “http”, the model generates URLs as expected because it doesn’t encounter the confusing token split.

‍

Using Logprobs to improve LLM features

Logprobs are handy during prototyping for spotting issues like hallucinations and capturing problematic token generation. They can also help build a solid classifier. In production, logprobs serve as a moderation tool, allowing you to easily isolate and address problematic prompts.

Using the logprob parameter can streamline your work and provide better structure than just tweaking prompts. Many of our customers are finding it valuable for building more reliable systems right from the start.

A good AI development platform can help with this experimentation.

With Vellum’s Prompt Playground, you can easily adjust your prompts and compare how different scenarios play out, whether you use the logprobs feature or not.

If you’re interested in comparing multiple prompts, with or without the logprob parameter — get in contact here.

ABOUT THE AUTHOR

Anita Kirkovska

Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

No items found.

The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska

Head of Engineering