Skip to content
← All writing
July 1, 2026·3 min read

Streaming LLM Responses in Your App

The single biggest upgrade you can make to an AI feature's feel is streaming. A response that takes six seconds to complete but starts appearing after half a second feels fast; the same response delivered all at once feels broken. Users don't measure completion time — they measure time to first sign of life.

How streaming actually works

LLM APIs generate text one token at a time, and every major provider can send those tokens to you as they're produced, usually over server-sent events (SSE): a long-lived HTTP response where the server writes small chunks as they become available. Instead of waiting for the full completion, you receive a stream of deltas — a few characters each — and append them to the UI as they arrive.

Nothing about the model gets faster. Total generation time is identical. What changes is perceived latency, and perceived latency is the one users actually experience.

Route it through your backend

Your app should never talk to the LLM provider directly — that would mean shipping your API key in the binary. So the stream has to pass through your backend, which means your backend must stream too:

Provider --SSE--> Your backend --SSE--> Flutter app

The backend receives provider deltas and forwards them to the client as its own SSE stream (or over an existing WebSocket if you have one). The important part is to forward chunks immediately rather than accumulating them. It's an easy mistake: one buffered read loop, and your "streaming" endpoint quietly becomes a batch endpoint.

The Flutter side

On the client, read the response as a stream instead of a future. With the standard http package you can send a Request and listen to response.stream; packages like flutter_client_sse handle SSE parsing for you. Decode each event, append the delta to a buffer, and rebuild the text widget:

final buffer = StringBuffer();
await for (final chunk in sseStream) {
  buffer.write(chunk.delta);
  setState(() => text = buffer.toString());
}

Two UI details matter more than they look:

  • Show something before the first token. There's still a gap while the model starts up. A subtle shimmer or "thinking" indicator covers it; a frozen screen doesn't.
  • Offer a stop button. Streams can be cancelled mid-generation. Cancelling saves the user's time and your tokens — wire it to abort the request, not just hide the text.

Pitfalls that break streaming in production

Buffering middleboxes. Nginx, some CDNs, and some serverless platforms buffer responses by default, which turns your stream back into one big chunk. For nginx, proxy_buffering off (or the X-Accel-Buffering: no header) fixes it. Always test streaming through your real infrastructure, not just localhost.

Timeouts. Long generations can outlive default idle timeouts on proxies and mobile HTTP clients. Raise them for the streaming endpoint specifically, and send a keep-alive comment every few seconds if the model pauses.

Markdown mid-render. If you render the streamed text as Markdown, partial input will briefly contain unclosed code fences and half-written links. Either render plain text while streaming and re-render as Markdown at the end, or use a renderer that tolerates incomplete input.

Structured output doesn't stream well. If the model is returning JSON your code will parse, streaming the raw text at the user gains nothing — you can't act on half a JSON object. Stream conversational text; fetch structured results whole.

When not to stream

Skip streaming when the output is short (a classification, a title, a yes/no), when it's machine-consumed rather than read, or when the extra moving parts aren't worth it for a rarely used feature. Streaming is UX polish for reading experiences — chat, drafting, summarization — and that's where it pays for itself.

The takeaway

Streaming is mostly plumbing: an SSE pass-through on your backend, a stream listener in Flutter, and attention to buffering and timeouts in between. A few hundred lines of unglamorous code, and your AI feature goes from "is it stuck?" to "it's already answering." Of everything you can do to make an LLM feature feel fast, this is the highest return on effort.