How to use Prompt Caching?

Supported by
Anthropic

What is Prompt Caching?

Prompt Caching enables caching of frequently used context between API calls — resulting with shorter response times, and lower processing costs.

Anthropic says that you can reduce latency by >2x and costs up to 90%. With that in mind, this feature is particularly useful when dealing with frequently asked questions or when using the same or similar prompts multiple times across different sessions or users. It's currently available for the Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Haiku models.

How does Prompt Caching work?

When an LLM processes a prompt, it generates internal representations called attention states, which help the model understand relationships between different parts of the input.

Traditionally, these attention states are recalculated every time the model processes a prompt, even if the input is similar or repeated, making it time-consuming and costly. Prompt caching solves this by storing previously computed attention states, allowing the model to retrieve them for similar prompts, speeding up responses and reducing costs.

Check the image below for a visual explanation:

Visual explaining how prompt caching works

Using Prompt Caching with Anthropic models

Now with Anthropic, every time you send an API request with prompt caching enabled the system will:

1/ Check if there is an already cached prompt from previous requests

2/ If found, it will use the previous cached version

3/ If not, it will process the full prompt and will cache it for future use

💡 Couple of important things to have in mind:

1/ This feature has a 5-minute lifetime, after which the caching will reset and refresh.

2/ You can define up to 4 cache breakpoints in your prompt.

3/ Prompt Caching references the entire prompt - tools, system, and messages (in that order) up to and including the block designated with cache_control.

4/ Currently, there’s no way to manually clear the cache. Cached prefixes automatically expire after 5 minutes of inactivity.

5/ Only one type of caching is available, "ephemeral".

6/ To monitor the effectiveness of the caching, you can monitor the performance using the cache_creation_input_tokens and cache_read_input_tokens fields in the API response.

7/ Your prompts need to have a minimum token count to be able to be cashed: 1024 tokens for Claude 3.5 Sonnet and Claude 3 Opus and 2048 tokens for Claude 3 Haiku.

How do I enable Prompt Caching?

To enable this feature, you’ll need to include the anthropic-beta: prompt-caching-2024-07-31 header in your API requests.

How can I use Prompt Caching?

Single-turn conversations

To make a cached API call in a single-turn conversation, all you need to do is specify the "cache_control": {"type": "ephemeral"} attribute to the content object. like so:

"content": [
                {
                    "type": "text",
                    "text": "<book>" + book_content + "</book>",
                    "cache_control": {"type": "ephemeral"}
                },

Multi-turn conversations

In cases where you have multi-turn conversations you can add cache breakpoints as the conversation develops.

In the example below, the cache_control parameter is placed on the System message to designate it as part of the static prefix. The final turn is marked with cache-control, for continuing in followups. Also, the second-to-last user message is marked for caching with the cache_control parameter, so that this checkpoint can read from the previous cache.

import anthropic
client = anthropic.Anthropic()

response = client.beta.prompt_caching.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "...long system prompt",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        # ...long conversation so far
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Hello, can you tell me more about the solar system?",
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        },
        {
            "role": "assistant",
            "content": "Certainly! The solar system is the collection of celestial bodies that orbit our Sun. It consists of eight planets, numerous moons, asteroids, comets, and other objects. The planets, in order from closest to farthest from the Sun, are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. Each planet has its own unique characteristics and features. Is there a specific aspect of the solar system you'd like to know more about?"
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Tell me more about Mars.",
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        }
    ]
)

How are Prompt Caching tokens priced?

Cached prompts are priced based on the number of input tokens you cache and how frequently you use that content. Writing to the cache costs 25% more than the model’s input token price, while using cached content is significantly cheaper, costing only 10% of the base input token price. Below you can read the whole pricing for each model:

Pricing Table
Model Base Input Tokens Cache Writes Cache Hits Output Tokens
Claude 3.5 Sonnet $3 / MTok $3.75 / MTok $0.30 / MTok $15 / MTok
Claude 3 Haiku $0.25 / MTok $0.30 / MTok $0.03 / MTok $1.25 / MTok
Claude 3 Opus $15 / MTok $18.75 / MTok $1.50 / MTok $75 / MTok

What can you cache with Prompt Caching?

Every block of your API request can be cached:

  • Tools: Tool definitions in the tools array
  • System messages: Content blocks in the system array
  • Messages: Content blocks in the messages.content array, for both user and assistant turns
  • Images: Content blocks in the messages.content array, in user turns
  • Tool use and tool results: Content blocks in the messages.content array, in both user and assistant turns

When to use Prompt Caching?

Prompt caching is very useful when you’re dealing with longer prompts. Think prompts with many examples, or when you’re retrieving big chunks of data from your vector db.

It’s also very useful in cases where you’re dealing with long multi-turn conversations, where your chatbot needs to remember previous instructions and tasks.

Here are some examples:

Conversational agents

Imagine having a chatbot that needs to handle long conversations or answer questions from uploaded documents. Instead of reprocessing the whole document each time, caching the responses can drastically reduce costs and make responses faster, especially when you’re dealing with repetitive or extended queries.

Coding assistants

If you’re using a tool like an autocomplete or Q&A assistant for coding, caching can help by storing a summarized version of the codebase.

This way, when the assistant pulls up suggestions or answers questions about the code, it doesn’t have to reprocess everything, speeding up the entire experience.

Large document processing

Suppose you have a long legal contract or a research paper that also includes images. Normally, incorporating this kind of detailed content in an AI prompt would slow things down.

But with caching, you can store the material and keep the latency low while still providing a complete and detailed response.

For few-shot instructions

Developers often include just a few examples in their prompts when working with AI models. However, with prompt caching, you can easily include dozens of high-quality examples without increasing the time it takes for the AI to respond. This is great for scenarios where you need the model to give highly accurate responses to complex instructions, like in customer service or technical troubleshooting.

Agentic search and tool use

For tasks that involve multiple steps or tools (like using APIs in stages), caching each round can enhance performance. For example, if you’re building an agent that searches and makes iterative changes based on new information, caching helps by skipping redundant steps.

Q&A over books, papers, and podcasts

Instead of embedding the whole text into a vector database, embed it into the prompt, and enable users to ask questions.