LLM Streaming is a technique to incrementally receive data as it is generated by an LLM. This contrasts with the default request-based model, where LLMs finish generating a response before dispatching it to the client.
LLM Streaming is a critical feature. LLMs can take time to generate responses; some complex replies can take over a minute. However, end users might lack the patience to wait, leading to churn.
Imagine you're building a content writer application that helps generate blog posts from voice memos. On average, creating a post takes about 1 minute with your current setup that uses even the fastest model. Now, picture your users waiting for 1 minute while a loading icon spins. Even if the content is really good and optimized, most users won’t have the patience to wait that long—they’ll expect quicker results.
That’s why we use LLM Streaming — you can combat this by eliminating long loaders and instead progressively display content. ChatGPT uses this feature too, showing your answer word by word as it's generated.
However, LLM Streaming is not just some toggle-switch feature to turn-on. Streaming alters how the LLM communicates with the backend, and by extension, how the backend communicates with the frontend. This leads to some necessary design decisions to support LLM streaming.
LLMs generate outputs progressively. These outputs are pumped out in tokens (small text units). LLM Streaming simply waits for new tokens to be available, batches them together, and dispatches the chunk to the client.
This isn’t a groundbreaking technique.
Streaming has been common to applications for some time, particularly those that serve large file formats like music (Spotify) and videos (Youtube). But there is a distinction; in most applications, streaming is implemented to combat file size. However, for LLMs, streaming is necessary because data is available at a tapered speed.
Technically, LLM streaming is a term that only describes streaming data from an LLM to a client backend. But practically, there is a second half to the equation: re-streaming data from the backend to the client frontend. Due to private API keys, LLMs cannot directly interface with the frontends; therefore, this double-streaming pattern is necessary.
You can configure LLM streaming on OpenAI’s Chat Completions API by setting the optional stream
configuration to true
. This is done when initializing a conversation with openai.chat.completions.create
.
You can set sequences on OpenAI’s Messages API by setting stream
parameter to true
. You can also use the built-in SDK function client.messages.stream
.
To stream on Gemini, you must used the dedicated models.streamGenerateContent
method.
There are a few approaches to re-streaming data to a client frontend. There are three main options: (a) polling, (b) server-side events, and (c) websockets.
Polling is a simple technique that isn’t exactly streaming. Instead, the client side keeps asking the backend for any new tokens available. This could be done either through (i) long polling, where a new request is made as soon as the last one is received, or (ii) short polling, where a request is made at a timed cadence.
Polling is easy to implement but isn’t the most efficient strategy because the frontend has to keep making fresh requests to the backend. This means the frontend isn’t inherently timed with newly-available data.
Server-side events, or SSE, is a streaming strategy that involves pushing data to the client as soon as it is available. SSE is more efficient than polling because it eliminates unnecessary network traffic. It is also uni-directional, where the server is exclusively dispatching data to the client-side outside of the initial request, which is best for text data because the package size is small. SSE is widely supported because it works over a standard HTTP connection.
Websockets are similar to server-side events, but act as a bi-directional channel for streaming data (data can be streamed in either direction). A websocket would be a common choice for something communication-based, like a webchat or audio call.
Websockets are generally overkill for streaming LLM events, but may be the ideal choice if the application anyways needs bi-directional data to be streamed.
Usually, the decision to use LLM streaming is cut-and-dry. By default, due to the additional set-up work, data should be received through traditional requests. However, if LLM generation delays are impacting the user experience, then companies should try LLM streaming.
The main experimentation is how to implement the backend to frontend connection. Most companies will use server-side events unless websockets are already in-place. Using polling isn’t common due to its inefficiencies, unless data needs to be split over multiple requests only occasionally.
Let’s explore an LLM streaming example in practice. We’ll use a NodeJS backend interfacing with OpenAI’s LLM, with the end client being a web browser. We’ll use server-side events as our re-streaming strategy.
We’ll break this down into two core snippets: (i) streaming data from the LLM’s API and re-streaming data to the frontend, and (ii) the frontend receiving the data from the backend.
To stream data from OpenAI’s API, initialize it, set-up an endpoint, and write a request with stream
set to true
.
Additionally, we’ll use a Content-Type
header of text/event-stream
and a Connection
header of keep-alive
to re-stream events via server-side events to the frontend.
//configure OpenAI
const configuration = new Configuration({
apiKey: process.env.OPENAI_API_KEY,
});
const openai = new OpenAIApi(configuration);
// streaming endpoint
app.get('/stream', async (req, res) => {
const userQuery = req.query.query;
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
try {
const response = await openai.createCompletion({
model: "gpt-4o-mini",
prompt: userQuery,
max_tokens: 100,
stream: true, // Enable streaming
}, { responseType: 'stream' });
//stream the data as it comes in
response.data.on('data', (chunk) => {
const lines = chunk
.toString('utf8')
.split('\n')
.filter((line) => line.trim() !== '');
for (const line of lines) {
const message = line.replace(/^data: /, '');
if (message === '[DONE]') {
res.write('data: [DONE]\n\n');
return res.end();
}
try {
const parsed = JSON.parse(message);
const token = parsed.choices[0].text;
res.write(`data: ${token}\n\n`);
} catch (error) {
console.error('Error parsing stream chunk', error);
}
}
});
} catch (err) {
console.error(err);
res.status(500).json({ error: 'Failed to stream data from OpenAI' });
}
});
app.listen(3000, () => {
console.log('Server is running on port 3000');
});
Next, we’ll use the EventSource
API to receive the data from the backend.
const eventSource = new EventSource('http://example.com/stream?query=Hello LLM');
//receive streamed data
eventSource.onmessage = function(event) {
//do something with event.data
};
eventSource.onerror = function(event) {
console.error("Error occurred:", event);
eventSource.close();
};
And that’s it! You’ll receive data as it is made available from the LLM, with the backend serving as the pipe.
Using a strategy like the above, or another strategy discussed in this article, you’ll be able to efficiently receive data from a LLM even when generation delays are non-trivial.