Keeping LLM Costs Under Control in Production
An AI feature is one of the few parts of your app with a per-use marginal cost. Every request burns tokens, tokens cost money, and — unlike server costs — the bill scales directly with how much users love the feature. Teams that ignore this discover it at invoice time. The playbook below is how you stay ahead of it.
Measure before you optimize
You can't control what you can't see. Log, for every LLM call: the feature that triggered it, input tokens, output tokens, and the model used. Providers return token counts in every response — store them. Within a day you'll know which feature dominates spend, and it's often not the one you'd guess. A chat screen with long histories can quietly cost more per user than your headline AI feature.
Set up a spending alert with your provider from day one. Not to be dramatic — a bug that retries in a loop can do real damage before a human notices.
Use the cheapest model that passes your evals
The most expensive habit in AI development is defaulting every call to the flagship model. Most production tasks — classification, extraction, reformatting, short summaries — work fine on small, fast models that cost a fraction as much. Reserve the big model for the tasks that measurably need it.
The disciplined version of this is routing: try the cheap model first, detect low-confidence or failed outputs, and escalate only those to the stronger model. Even routing by task type alone (cheap model for autocomplete, strong model for long-form generation) captures most of the savings with none of the complexity.
This only works if you have evals — a test set that tells you whether the cheap model is actually good enough. Without evals, model downgrades are guesses.
Attack the input tokens
In most apps, input tokens dwarf output tokens: you send the system prompt, instructions, examples, history, and retrieved context on every call.
- Trim the system prompt. Prompts accrete instructions over time like config files. Audit it; delete what no longer changes behavior.
- Prune chat history. Don't send the entire conversation forever. Keep the last few turns plus a running summary of older ones.
- Retrieve less. If you're doing RAG, sending ten chunks "to be safe" costs real money on every request. Measure whether five do the job.
- Use prompt caching. Providers offer caching for the repeated prefix of your prompt (system prompt, examples, static context), charging a much lower rate for cached tokens. If your prompt has a large static prefix, this is nearly free savings.
Also cap max_tokens on every call. It's your hard ceiling against a model rambling at your expense.
Cache whole responses where repetition exists
If many users trigger the same request — summarizing the same article, explaining the same error — cache the response keyed on the normalized input and skip the model entirely on repeats. Exact-match caching is trivial and safe. Semantic caching (serving a cached answer for a similar question) saves more but risks serving subtly wrong answers; use it only where approximate answers are acceptable.
Design the product so waste never ships
The biggest savings are product decisions, not engineering ones:
- Debounce. Never call the model on every keystroke; wait for the user to pause or submit.
- Make expensive actions explicit. A "Summarize" button costs when pressed. Auto-summarizing everything costs always.
- Set per-user limits. A free tier with unmetered LLM access is an open tab. Rate-limit generously enough that honest users never notice.
- Don't regenerate silently. Retries and "improve this" loops multiply cost; make them visible, deliberate actions.
The takeaway
LLM cost control is four habits: measure per feature, route to the cheapest model your evals allow, shrink and cache the tokens you send, and shape the product so the expensive path is the deliberate one. Do these early — retrofitting cost discipline after a viral month is a much worse week.