Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.

On this page

LLM Streaming UX: Backpressure, Cancellation, Partial Results

Streaming LLM responses is one of those features that looks simple — tokens stream from the model to the client, user sees them as they arrive, done. The complexity hides in the failure modes: the client disconnects, the model stalls, the user cancels, the request times out. Each of those needs explicit handling or you leak money and frustrate users. This is what we've learned shipping streaming endpoints over the last year.

Why stream at all #

For chat-style features and long-form generation, streaming makes a huge UX difference. A 30-second response that streams from token 1 feels faster than a 5-second response delivered all at once — because the user sees activity immediately, gets a sense of the answer's shape, and can interrupt if it's going wrong.

For non-interactive use cases (background batch, structured extraction), streaming adds complexity without value. Use the non-streaming endpoint.

The patterns below assume streaming is the right call.

The transport layer #

Two common options:

Server-Sent Events (SSE) — the boring, well-understood, web-native choice. Browser EventSource API works out of the box; servers can implement with stdlib in most languages.

WebSockets — more flexible (bidirectional, binary), more complex. Useful when you want client → server messages mid-stream. Overkill for most chat-streaming use cases.

We use SSE for almost all our streaming endpoints. The few exceptions are real-time voice features where bidirectional binary is required.

SSE looks like this on the wire:

code

HTTP/1.1 200 OK
Content-Type: text/event-stream

data: {"type":"token","content":"Hello"}

data: {"type":"token","content":" world"}

data: {"type":"done","total_tokens":42}

Each data: line is one event. Two newlines separate events. Easy to debug with curl.

Cancellation: the most important pattern #

If the user closes the tab, navigates away, or hits cancel, the request to your server is over. But the LLM call is still running, generating tokens you're paying for. Without explicit cancellation, you keep generating until the model is done; the wasted tokens add up.

The pattern that works:

Client opens SSE connection.
Server starts the LLM call with an AbortController (or equivalent in your language).
Server detects the client disconnect via the connection's close event.
On disconnect, server aborts the LLM call.
LLM stops generating; tokens stop counting.

In Node.js:

javascript.javascript

app.get("/chat", async (req, res) => {
  const controller = new AbortController();
  req.on("close", () => controller.abort());
  
  res.writeHead(200, {
    "Content-Type": "text/event-stream",
    "Cache-Control": "no-cache",
    Connection: "keep-alive",
  });
  
  const stream = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: req.body.messages,
    stream: true,
  }, { signal: controller.signal });
  
  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) {
      res.write(`data: ${JSON.stringify({ type: "token", content })}\n\n`);
    }
  }
  
  res.end();
});

The key line is req.on("close", () => controller.abort()). Without it, killed clients waste tokens silently. With it, cancellation propagates to the LLM call within a few hundred ms.

We measured the savings once: ~12% of all chat requests get cancelled before completion. Without abort handling, we'd be paying for the full responses on all of those. Real money.

Timeouts #

A long timeout protects against runaway requests. We set:

First-byte timeout: 30 seconds. If the LLM doesn't produce any token in 30 seconds, abort. Usually means the provider is having issues; better to fail fast.
Total timeout: 180 seconds (3 minutes). Upper bound on any response. Long enough for legitimate long generations; short enough that hung requests don't pile up.
Inter-token timeout: 15 seconds. If 15 seconds elapse between tokens, abort. Catches stalled streams that haven't formally errored.

These run as race conditions against the LLM stream. Whichever fires first wins; on timeout, we abort the underlying call and return a structured error to the client.

Backpressure #

If the client is slow to consume tokens (slow network, slow renderer), the server's output buffer fills. Without handling, the server keeps generating from the LLM faster than the client can receive. Tokens queue in memory, server resources are tied up, and the LLM keeps charging.

The fix in Node.js: respect the writable stream's write() return value. write() returns false when the buffer is full; wait for drain before writing more.

javascript.javascript

function writeEvent(res, event) {
  const ok = res.write(`data: ${JSON.stringify(event)}\n\n`);
  if (!ok) {
    return new Promise(resolve => res.once("drain", resolve));
  }
}

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    await writeEvent(res, { type: "token", content });
  }
}

The await pauses the for-loop when the client is slow. The LLM stream is paused too (the for-await blocks on yielding the next chunk to the loop body, which is awaiting drain). The result: server matches the client's pace; no in-memory queue buildup.

This is the part that most streaming implementations skip. It works fine in low-volume testing; only at scale does the lack of backpressure become a problem.

Partial results: what happens on error mid-stream #

The LLM has already streamed 200 tokens. Then the connection drops, or the model hits a content filter, or the API errors. What does the client see?

Options:

Discard the partial response. Cleanest from a "all-or-nothing" perspective; bad UX (user watched 200 tokens appear, then they vanish).
Keep the partial, mark it incomplete. What we do. Client UI shows the tokens that arrived, plus an "incomplete" badge or retry button. User can read what got through.
Auto-retry from where it left off. Tempting but hard; you don't know how to prompt the model to continue from token N.

We chose option 2. When an error occurs mid-stream, we emit a final event:

code

data: {"type":"error","error":"upstream timeout","partial":true}

The client knows the response is incomplete and renders accordingly. We don't try to retry transparently; users prefer "this got cut off, try again?" over silently losing context.

Buffering for downstream UX #

Streaming token-by-token is technically possible but visually janky — words appear letter by letter, sometimes mid-word breaks render oddly.

We buffer to word boundaries on the server before sending:

javascript.javascript

let buffer = "";
for await (const chunk of stream) {
  buffer += chunk.choices[0]?.delta?.content ?? "";
  while (buffer.includes(" ")) {
    const splitIdx = buffer.lastIndexOf(" ");
    const wordChunk = buffer.slice(0, splitIdx + 1);
    buffer = buffer.slice(splitIdx + 1);
    await writeEvent(res, { type: "token", content: wordChunk });
  }
}
if (buffer) await writeEvent(res, { type: "token", content: buffer });

The user sees words appear, not letters. Slight latency cost (a few hundred ms at most); much better visual experience.

For markdown responses we go further — buffer to paragraph or list-item boundaries, since rendering mid-list-item HTML looks broken in the client.

What we monitor #

For streaming endpoints specifically:

First-byte latency — time from request to first token. Should be < 2s typically.
Tokens per second during the stream. Catches stalled models.
% of requests cancelled before completion — informational; helps size cost projections.
% of requests that hit timeout — should be near zero; spikes mean upstream provider issues.
Bytes in vs bytes out — sanity check for buffering issues.

These go in Datadog alongside non-streaming endpoint metrics. The dashboard separates streaming and non-streaming because the distributions are wildly different.

What we don't bother with #

A few patterns we considered and skipped:

Resumable streams. Clients reconnecting after a network blip and resuming mid-response. Possible but rare in chat UX (users just retry); high implementation cost.

Compression on the stream. Gzip on SSE works but introduces complexity (compression buffering competing with our backpressure). We send uncompressed; trust HTTP/2 multiplexing to keep overhead low.

Token-by-token rendering on the client. The buffered word-level approach looks better; the marginal "I see characters appear" effect isn't worth implementation complexity.

What I'd tell a team starting #

Cancellation handling on day one. Wasted tokens compound. The req.on("close") line is non-optional.

Three timeouts, not one. First-byte, inter-token, total. Each catches a different failure.

Respect backpressure. write() returns false sometimes. The await-drain pattern matters at scale.

Word-boundary buffering. Small implementation change; much better visual quality.

Monitor first-byte latency separately from total latency. They're different signals.

Partial results with explicit "incomplete" markers. Don't discard, don't pretend it's complete.

Streaming LLM endpoints look simple until the failure modes show up. The patterns above are what survived a year of running these in production. None are exotic; all are easy to skip until they bite. Adopting them up front saves the eventual scramble of "why is our LLM bill twice what it should be" investigations.

LLM Streaming UX — Backpressure, Cancellation, Partial Results

LLM Streaming UX: Backpressure, Cancellation, Partial Results

Why stream at all #

The transport layer #

Cancellation: the most important pattern #

Timeouts #

Backpressure #

Partial results: what happens on error mid-stream #

Buffering for downstream UX #

What we monitor #

What we don't bother with #

What I'd tell a team starting #

Stay Updated

Internal Developer Platforms — Backstage in Practice

AWS Step Functions for Workflow Orchestration

More from AI

Embeddings Drift Detection — When "Similar Enough" Stops Being Similar

AI Agent Tool Design — Boundaries and Confirmations

What Are Embeddings? A Beginner's Guide with Code

Embeddings Drift Detection — When "Similar Enough" Stops Being Similar

AI Agent Tool Design — Boundaries and Confirmations

What Are Embeddings? A Beginner's Guide with Code

Prompt Engineering Basics — From "Help Me" to Working Prompts

Build Your First RAG App in 100 Lines of Python

Fine-Tuning vs RAG vs Long-Context: A Decision Framework With Numbers

About Admin

You might have missed

GitOps with Argo CD: Best Practices for 2025

Linux Performance Tuning for Containers and Kubernetes Nodes

AWS Lambda and Serverless Best Practices for Production