Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
Streaming LLM responses is one of those features that looks simple — tokens stream from the model to the client, user sees them as they arrive, done. The complexity hides in the failure modes: the client disconnects, the model stalls, the user cancels, the request times out. Each of those needs explicit handling or you leak money and frustrate users. This is what we've learned shipping streaming endpoints over the last year.
For chat-style features and long-form generation, streaming makes a huge UX difference. A 30-second response that streams from token 1 feels faster than a 5-second response delivered all at once — because the user sees activity immediately, gets a sense of the answer's shape, and can interrupt if it's going wrong.
For non-interactive use cases (background batch, structured extraction), streaming adds complexity without value. Use the non-streaming endpoint.
The patterns below assume streaming is the right call.
Two common options:
Server-Sent Events (SSE) — the boring, well-understood, web-native choice. Browser EventSource API works out of the box; servers can implement with stdlib in most languages.
WebSockets — more flexible (bidirectional, binary), more complex. Useful when you want client → server messages mid-stream. Overkill for most chat-streaming use cases.
We use SSE for almost all our streaming endpoints. The few exceptions are real-time voice features where bidirectional binary is required.
SSE looks like this on the wire:
HTTP/1.1 200 OK
Content-Type: text/event-stream
data: {"type":"token","content":"Hello"}
data: {"type":"token","content":" world"}
data: {"type":"done","total_tokens":42}
Each data: line is one event. Two newlines separate events. Easy to debug with curl.
If the user closes the tab, navigates away, or hits cancel, the request to your server is over. But the LLM call is still running, generating tokens you're paying for. Without explicit cancellation, you keep generating until the model is done; the wasted tokens add up.
The pattern that works:
AbortController (or equivalent in your language).In Node.js:
app.get("/chat", async (req, res) => {
const controller = new AbortController();
req.on("close", () => controller.abort());
res.writeHead(200, {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
Connection: "keep-alive",
});
const stream = await openai.chat.completions.create({
model: "gpt-4o",
messages: req.body.messages,
stream: true,
}, { signal: controller.signal });
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
res.write(`data: ${JSON.stringify({ type: "token", content })}\n\n`);
}
}
res.end();
});
The key line is req.on("close", () => controller.abort()). Without it, killed clients waste tokens silently. With it, cancellation propagates to the LLM call within a few hundred ms.
We measured the savings once: ~12% of all chat requests get cancelled before completion. Without abort handling, we'd be paying for the full responses on all of those. Real money.
A long timeout protects against runaway requests. We set:
These run as race conditions against the LLM stream. Whichever fires first wins; on timeout, we abort the underlying call and return a structured error to the client.
If the client is slow to consume tokens (slow network, slow renderer), the server's output buffer fills. Without handling, the server keeps generating from the LLM faster than the client can receive. Tokens queue in memory, server resources are tied up, and the LLM keeps charging.
The fix in Node.js: respect the writable stream's write() return value. write() returns false when the buffer is full; wait for drain before writing more.
function writeEvent(res, event) {
const ok = res.write(`data: ${JSON.stringify(event)}\n\n`);
if (!ok) {
return new Promise(resolve => res.once("drain", resolve));
}
}
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
await writeEvent(res, { type: "token", content });
}
}
The await pauses the for-loop when the client is slow. The LLM stream is paused too (the for-await blocks on yielding the next chunk to the loop body, which is awaiting drain). The result: server matches the client's pace; no in-memory queue buildup.
This is the part that most streaming implementations skip. It works fine in low-volume testing; only at scale does the lack of backpressure become a problem.
The LLM has already streamed 200 tokens. Then the connection drops, or the model hits a content filter, or the API errors. What does the client see?
Options:
We chose option 2. When an error occurs mid-stream, we emit a final event:
data: {"type":"error","error":"upstream timeout","partial":true}
The client knows the response is incomplete and renders accordingly. We don't try to retry transparently; users prefer "this got cut off, try again?" over silently losing context.
Streaming token-by-token is technically possible but visually janky — words appear letter by letter, sometimes mid-word breaks render oddly.
We buffer to word boundaries on the server before sending:
let buffer = "";
for await (const chunk of stream) {
buffer += chunk.choices[0]?.delta?.content ?? "";
while (buffer.includes(" ")) {
const splitIdx = buffer.lastIndexOf(" ");
const wordChunk = buffer.slice(0, splitIdx + 1);
buffer = buffer.slice(splitIdx + 1);
await writeEvent(res, { type: "token", content: wordChunk });
}
}
if (buffer) await writeEvent(res, { type: "token", content: buffer });
The user sees words appear, not letters. Slight latency cost (a few hundred ms at most); much better visual experience.
For markdown responses we go further — buffer to paragraph or list-item boundaries, since rendering mid-list-item HTML looks broken in the client.
For streaming endpoints specifically:
These go in Datadog alongside non-streaming endpoint metrics. The dashboard separates streaming and non-streaming because the distributions are wildly different.
A few patterns we considered and skipped:
Resumable streams. Clients reconnecting after a network blip and resuming mid-response. Possible but rare in chat UX (users just retry); high implementation cost.
Compression on the stream. Gzip on SSE works but introduces complexity (compression buffering competing with our backpressure). We send uncompressed; trust HTTP/2 multiplexing to keep overhead low.
Token-by-token rendering on the client. The buffered word-level approach looks better; the marginal "I see characters appear" effect isn't worth implementation complexity.
Cancellation handling on day one. Wasted tokens compound. The req.on("close") line is non-optional.
Three timeouts, not one. First-byte, inter-token, total. Each catches a different failure.
Respect backpressure. write() returns false sometimes. The await-drain pattern matters at scale.
Word-boundary buffering. Small implementation change; much better visual quality.
Monitor first-byte latency separately from total latency. They're different signals.
Partial results with explicit "incomplete" markers. Don't discard, don't pretend it's complete.
Streaming LLM endpoints look simple until the failure modes show up. The patterns above are what survived a year of running these in production. None are exotic; all are easy to skip until they bite. Adopting them up front saves the eventual scramble of "why is our LLM bill twice what it should be" investigations.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We adopted Backstage for service catalogs and templates. What works, what was over-engineered for our size, and what we'd do differently.
We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Embeddings turn text into numbers a computer can compare. Here's the working mental model, a runnable Python example, and where embeddings fit in real apps.
Evergreen posts worth revisiting.