Streaming LLM Responses in Your App

The single biggest upgrade you can make to an AI feature's feel is streaming. A response that takes six seconds to complete but starts appearing after half a second feels fast; the same response delivered all at once feels broken. Users don't measure completion time — they measure time to first sign of life. Every chat product you've ever used streams for exactly this reason, and once you've seen your own feature with and without it, you won't ship the "without" version again.

This post walks through the whole pipe: how streaming works at the API level, how to relay it through your backend, how to consume it in a Flutter app, and the handful of production pitfalls that catch nearly everyone once.

How streaming actually works

LLM APIs generate text one token at a time — that's simply how the models work — and every major provider can send those tokens to you as they're produced, usually over server-sent events (SSE): a long-lived HTTP response where the server writes small data: chunks as they become available. Instead of waiting for the full completion, you receive a stream of deltas — a few characters each — and append them to the UI as they arrive. When generation ends, the stream closes with a final event carrying metadata like token counts and the stop reason.

It's worth being clear about what streaming is not: nothing about the model gets faster. Total generation time is identical to the blocking version. What changes is perceived latency — the user reads the beginning of the answer while the rest is still being written — and perceived latency is the one users actually experience. As a bonus, they can also see early that the answer is going the wrong way and stop it, which saves your tokens too.

Route it through your backend

Your app should never talk to the LLM provider directly — that would mean shipping your API key in the binary, where anyone with a proxy can extract it. So the stream has to pass through your backend, which means your backend must stream too:

Provider --SSE--> Your backend --SSE--> Flutter app

The backend receives provider deltas and forwards them to the client as its own SSE stream (or over an existing WebSocket if you have one — either transport works; SSE is simpler if you don't already run sockets). This is also the right place for everything else the client can't be trusted with: authentication, rate limiting, logging token usage, and filtering or transforming the stream before it reaches the user.

The important implementation detail is to forward chunks immediately rather than accumulating them. It's an easy mistake: one buffered read loop, one framework helper that collects the body before returning it, and your "streaming" endpoint quietly becomes a batch endpoint that just holds the connection open longer. Watch the network tab the first time — you should see bytes trickling continuously, not one burst at the end.

The Flutter side

On the client, read the response as a stream instead of a future. With the standard http package you can send a Request and listen to response.stream; packages like flutter_client_sse handle the SSE wire format (event boundaries, data: prefixes, reconnection) for you. Decode each event, append the delta to a buffer, and rebuild the text widget:

final buffer = StringBuffer();
await for (final chunk in sseStream) {
  buffer.write(chunk.delta);
  setState(() => text = buffer.toString());
}

At normal token rates, rebuilding a Text widget per chunk is fine; if you're rendering something heavier, throttle UI updates to every 50–100ms rather than every delta — the eye can't follow individual tokens anyway.

A few UI details matter more than they look:

Show something before the first token. There's still a gap while the model processes your prompt. A subtle shimmer or "thinking" indicator covers it; a frozen screen doesn't.
Offer a stop button. Streams can be cancelled mid-generation. Cancelling saves the user's time and your tokens — wire it to actually abort the HTTP request all the way through your backend to the provider, not just hide the text.
Handle mid-stream errors visibly. A connection can drop after half an answer. Keep the partial text, mark it as incomplete, and offer a retry — silently vanishing text is the worst outcome.
Think about scroll. If the text grows past the viewport, decide whether you auto-follow the bottom (chat-style) or hold position, and don't fight the user if they scroll up while it's still writing.

Pitfalls that break streaming in production

Buffering middleboxes. Nginx, some CDNs, and some serverless platforms buffer responses by default, which turns your carefully-built stream back into one big chunk. For nginx, proxy_buffering off (or sending the X-Accel-Buffering: no response header) fixes it; compression middleware can have the same effect and may need the endpoint excluded. Always test streaming through your real infrastructure, not just localhost — this is the number-one "works in dev, broken in prod" cause for streaming features.

Timeouts. Long generations can outlive default idle timeouts on proxies, load balancers, and mobile HTTP clients — many default to 30 or 60 seconds. Raise them for the streaming endpoint specifically, and send a keep-alive comment line every few seconds if the model pauses, so nothing on the path decides the connection is dead.

Markdown mid-render. If you render the streamed text as Markdown, partial input will briefly contain unclosed code fences and half-written links, which some renderers display as garbage. Either render plain text while streaming and re-render as Markdown at the end, or use a renderer that tolerates incomplete input gracefully.

Structured output doesn't stream well. If the model is returning JSON your code will parse, streaming the raw text at the user gains nothing — you can't act on half a JSON object. Stream conversational text; fetch structured results whole. (If you need both, a common pattern is streaming the human-readable part and delivering the structured part in the final event.)

Mobile lifecycle. Phones background apps mid-stream. Decide what happens when the user switches away and back: resume rendering from the buffer, or mark the response incomplete. Test it — the default behaviour is usually "something confusing."

When not to stream

Skip streaming when the output is short (a classification, a title, a yes/no), when it's machine-consumed rather than read, or when the extra moving parts aren't worth it for a rarely used feature. Streaming is UX polish for reading experiences — chat, drafting, summarization, explanation — and that's where it pays for itself. A three-word answer doesn't need a typewriter effect.

The takeaway

Streaming is mostly plumbing: an SSE pass-through on your backend that forwards deltas immediately, a stream listener in Flutter, and attention to buffering middleboxes and timeouts in between. A few hundred lines of unglamorous code, and your AI feature goes from "is it stuck?" to "it's already answering." Of everything you can do to make an LLM feature feel fast, this is the highest return on effort — do it before you optimize anything else.