What is LLM Streaming and How to Use It?

Supported by
OpenAI
Anthropic
Google

What is LLM Streaming?

LLM Streaming is a technique to incrementally receive data as it is generated by an LLM. This contrasts with the default request-based model, where LLMs finish generating a response before dispatching it to the client.

Why do you need LLM Streaming?

LLM Streaming is a critical feature. LLMs can take time to generate responses; some complex replies can take over a minute. However, end users might lack the patience to wait, leading to churn.

Imagine you're building a content writer application that helps generate blog posts from voice memos. On average, creating a post takes about 1 minute with your current setup that uses even the fastest model. Now, picture your users waiting for 1 minute while a loading icon spins. Even if the content is really good and optimized, most users won’t have the patience to wait that long—they’ll expect quicker results.

LLM streaming is supported in ChatGPT

That’s why we use LLM Streaming  — you can combat this by eliminating long loaders and instead progressively display content. ChatGPT uses this feature too, showing your answer word by word as it's generated.

However, LLM Streaming is not just some toggle-switch feature to turn-on. Streaming alters how the LLM communicates with the backend, and by extension, how the backend communicates with the frontend. This leads to some necessary design decisions to support LLM streaming.

How does LLM Streaming work?

LLMs generate outputs progressively. These outputs are pumped out in tokens (small text units). LLM Streaming simply waits for new tokens to be available, batches them together, and dispatches the chunk to the client.

This isn’t a groundbreaking technique.

Streaming has been common to applications for some time, particularly those that serve large file formats like music (Spotify) and videos (Youtube). But there is a distinction; in most applications, streaming is implemented to combat file size. However, for LLMs, streaming is necessary because data is available at a tapered speed.

Technically, LLM streaming is a term that only describes streaming data from an LLM to a client backend. But practically, there is a second half to the equation: re-streaming data from the backend to the client frontend. Due to private API keys, LLMs cannot directly interface with the frontends; therefore, this double-streaming pattern is necessary.

How do you set-up LLM streaming?

OpenAI

You can configure LLM streaming on OpenAI’s Chat Completions API by setting the optional stream configuration to true. This is done when initializing a conversation with openai.chat.completions.create .

Anthropic

You can set sequences on OpenAI’s Messages API by setting stream parameter to true. You can also use the built-in SDK function client.messages.stream.

Gemini

To stream on Gemini, you must used the dedicated models.streamGenerateContent method.

How do you stream data to the client side?

There are a few approaches to re-streaming data to a client frontend. There are three main options: (a) polling, (b) server-side events, and (c) websockets.

Polling

Polling is a simple technique that isn’t exactly streaming. Instead, the client side keeps asking the backend for any new tokens available. This could be done either through (i) long polling, where a new request is made as soon as the last one is received, or (ii) short polling, where a request is made at a timed cadence.

Polling is easy to implement but isn’t the most efficient strategy because the frontend has to keep making fresh requests to the backend. This means the frontend isn’t inherently timed with newly-available data.

Server-side events

Server-side events, or SSE, is a streaming strategy that involves pushing data to the client as soon as it is available. SSE is more efficient than polling because it eliminates unnecessary network traffic. It is also uni-directional, where the server is exclusively dispatching data to the client-side outside of the initial request, which is best for text data because the package size is small. SSE is widely supported because it works over a standard HTTP connection.

Websockets

Websockets are similar to server-side events, but act as a bi-directional channel for streaming data (data can be streamed in either direction). A websocket would be a common choice for something communication-based, like a webchat or audio call.

Websockets are generally overkill for streaming LLM events, but may be the ideal choice if the application anyways needs bi-directional data to be streamed.

How do you experiment with LLM streaming?

Usually, the decision to use LLM streaming is cut-and-dry. By default, due to the additional set-up work, data should be received through traditional requests. However, if LLM generation delays are impacting the user experience, then companies should try LLM streaming.

The main experimentation is how to implement the backend to frontend connection. Most companies will use server-side events unless websockets are already in-place. Using polling isn’t common due to its inefficiencies, unless data needs to be split over multiple requests only occasionally.

How to use LLM streaming?

Let’s explore an LLM streaming example in practice. We’ll use a NodeJS backend interfacing with OpenAI’s LLM, with the end client being a web browser. We’ll use server-side events as our re-streaming strategy.

We’ll break this down into two core snippets: (i) streaming data from the LLM’s API and re-streaming data to the frontend, and (ii) the frontend receiving the data from the backend.

1. Streaming data from OpenAI’s API

To stream data from OpenAI’s API, initialize it, set-up an endpoint, and write a request with stream set to true.

Additionally, we’ll use a Content-Type header of text/event-stream and a Connection header of keep-alive to re-stream events via server-side events to the frontend.

//configure OpenAI
const configuration = new Configuration({
  apiKey: process.env.OPENAI_API_KEY,
});

const openai = new OpenAIApi(configuration);

// streaming endpoint
app.get('/stream', async (req, res) => {
  const userQuery = req.query.query;

  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  try {
    const response = await openai.createCompletion({
      model: "gpt-4o-mini",
      prompt: userQuery,
      max_tokens: 100,
      stream: true, // Enable streaming
    }, { responseType: 'stream' });

		//stream the data as it comes in 
    response.data.on('data', (chunk) => {
      const lines = chunk
        .toString('utf8')
        .split('\n')
        .filter((line) => line.trim() !== '');

      for (const line of lines) {
        const message = line.replace(/^data: /, '');
        if (message === '[DONE]') {
          res.write('data: [DONE]\n\n');
          return res.end();
        }

        try {
          const parsed = JSON.parse(message);
          const token = parsed.choices[0].text;
          res.write(`data: ${token}\n\n`);
        } catch (error) {
          console.error('Error parsing stream chunk', error);
        }
      }
    });

  } catch (err) {
    console.error(err);
    res.status(500).json({ error: 'Failed to stream data from OpenAI' });
  }
});

app.listen(3000, () => {
  console.log('Server is running on port 3000');
});

2. Frontend: Receiving the Streamed Data

Next, we’ll use the EventSource API to receive the data from the backend.

const eventSource = new EventSource('http://example.com/stream?query=Hello LLM');

//receive streamed data
eventSource.onmessage = function(event) {
   //do something with event.data
};

eventSource.onerror = function(event) {
    console.error("Error occurred:", event);
    eventSource.close();
};

And that’s it! You’ll receive data as it is made available from the LLM, with the backend serving as the pipe.

Using a strategy like the above, or another strategy discussed in this article, you’ll be able to efficiently receive data from a LLM even when generation delays are non-trivial.