How to Reduce LLM Latency in Your App

Language models are useful but slow, and in an app, slow feels broken. Users forgive a lot, but they don't forgive staring at a spinner: a feature that takes eight seconds to respond will be abandoned no matter how good the answer is, and the user won't file a bug report — they'll just stop using it. The good news: most LLM latency is addressable with ordinary engineering, not magic. This post covers the techniques that make AI features feel fast, roughly in order of impact.

Understand where the time goes

An LLM request spends time in a few distinct places:

Network round trips to the provider — including TLS handshakes and, on mobile, whatever the user's cell connection feels like today.
Time to first token (TTFT) — how long the model takes to process your input and start producing output. This scales with input length, which is one more reason lean prompts matter.
Generation time — proportional to output length, because tokens are generated one at a time.

Different fixes target different parts, so it helps to know which one is hurting you before you start. And keep one distinction in mind throughout: actual latency (the stopwatch) and perceived latency (what the user experiences) are different problems, and the second one is often cheaper to fix.

Stream the response

This is the single highest-impact change for user-facing features. Instead of waiting for the entire answer, stream tokens as they're generated and render them live. The total time may be identical, but the user sees words appearing within a few hundred milliseconds instead of staring at a spinner for eight seconds. Perceived speed is what they judge, and streaming transforms it — it's the difference between "this is thinking with me" and "this is stuck."

Every major provider supports streaming, every chat product you've ever used does it, and the client-side work (typically consuming server-sent events) is well-trodden. If your feature shows generated text to a user, stream it. The only cases where streaming doesn't apply are structured outputs your code consumes whole — and those should usually be short anyway.

Pick a smaller, faster model

The reflex to use the most capable model is expensive in latency as well as cost. Smaller models respond faster — often dramatically so, both in time-to-first-token and tokens per second. Since most app tasks are simple (classification, extraction, short rewrites), a small model frequently delivers the same result in a fraction of the time. Test your task on a smaller model against a real evaluation set; you may be paying an eight-second tax for capability you never needed.

The tiered version: route simple requests to the fast model and reserve the big one for requests that genuinely need it. Users get instant responses for the common case, and the slow path becomes the exception rather than the rule.

Cache repeated work

Many requests repeat, exactly or nearly. Caching turns a slow model call into an instant lookup — it's the only technique on this list that can take latency to effectively zero.

Exact caching. Hash the normalized request; if you've answered it before, return the stored answer immediately. Trivial to build, completely safe, and in features where users hit the same content (explaining the same error, summarizing the same page) the hit rate is often substantial.
Semantic caching. For similar-but-not-identical requests, match on meaning (via embeddings) and reuse a prior answer when close enough. More savings, more risk of serving a subtly wrong answer — reserve it for cases where approximately-right is fine.
Provider-side prompt caching. Distinct from both: providers can cache the static prefix of your prompt (system instructions, examples), which cuts time-to-first-token as well as cost on every call. If your prompt has a large fixed prefix, enable it.

Shorten the prompt and the output

Every token costs time. A bloated prompt takes longer to process before the first token appears; a rambling response takes longer to generate, token by token.

Trim redundant instructions and boilerplate from your prompt.
In RAG, retrieve only the most relevant context, not entire documents.
Cap output length with max_tokens, and ask for brevity in the prompt — "answer in one sentence" is a latency optimization wearing a style hat. If the output feeds code rather than a human, request the tersest format that parses.

Doing less work is the most reliable way to do it faster, and unlike most performance work it makes the bill smaller too.

Do work in parallel

If a task needs several independent model calls — say, generating a title, tags, and a summary for the same document — run them concurrently rather than one after another. Total latency becomes the slowest single call instead of the sum of all of them. Every LLM SDK supports this with your language's ordinary async tools; the sequential version is usually just an accident of how the code grew.

Similarly, overlap setup with waiting: fetch your RAG context while the user is still typing, warm the connection early, prepare the prompt while other I/O happens. Anything that isn't on the critical path shouldn't run on it.

Hide latency in the UX

Sometimes the fix is design, not speed:

Optimistic UI. Show something useful immediately — the skeleton of the result, the user's input echoed into place — while the real answer loads.
Prefetch. If you can predict the likely next request (the next item in a list, the obvious follow-up), start it before the user asks.
Do it in the background. Not every AI result needs to block the user. Auto-tagging, enrichment, and summaries-for-later can happen out of sight, where latency stops mattering entirely.
Good loading states. A staged progress indicator ("Searching your documents… Writing answer…") makes waiting feel shorter and tells the user the system is alive. A frozen spinner communicates "possibly crashed."

One caution from experience: don't fake precision in progress bars. Users forgive "working on it" and resent a bar that hits 90% and sits there.

Measure before optimizing

Don't guess which part is slow — instrument it. Log time-to-first-token and total time per request type in production, and look at percentiles rather than averages: an average of two seconds can hide a p95 of nine, and the p95 users are the ones who churn. You'll usually find a few request types dominate the pain. Optimize those, then measure again to confirm the change helped. Chasing latency you didn't measure wastes effort on things that were never the problem.

Summary

Fast AI features come from a stack of techniques, not one trick: stream responses so they feel instant, use the smallest model that passes your evals, cache repeated work at every layer, keep prompts and outputs lean, parallelize independent calls, and hide the unavoidable waits behind honest UX. Measure where the time actually goes before you optimize, and measure again after. Latency is rarely a dead end — it's an engineering problem, and these are the levers that solve it.