Keeping LLM Costs Under Control in Production

An AI feature is one of the few parts of your app with a per-use marginal cost. Every request burns tokens, tokens cost money, and — unlike server costs, which grow in comfortable steps — the bill scales directly and immediately with how much users love the feature. Success and spend are the same curve. Teams that ignore this discover it at invoice time, usually in the month after something went well.

The good news is that LLM spend responds to engineering discipline better than almost any other cloud cost, because the levers are all in your hands: which model you call, what you send it, how often, and whether you needed to call it at all. The playbook below is how you stay ahead of it.

Measure before you optimize

You can't control what you can't see. Log, for every LLM call: the feature that triggered it, input tokens, output tokens, and the model used. Providers return token counts in every response — store them; this is an afternoon of work. Within a day of real traffic you'll know which feature dominates spend, and it's often not the one you'd guess. A chat screen that resends a growing history every turn can quietly cost more per user than your headline AI feature, because history growth is invisible in testing and relentless in production.

Two more habits worth adopting on day one:

Set a spending alert with your provider. Not to be dramatic — a bug that retries in a loop, or a scraper hammering an unauthenticated endpoint, can do real damage before a human notices. An alert turns "surprise invoice" into "surprise Slack message," which is a much better class of surprise.
Track cost per feature per day, not just the total. Totals tell you that spend went up; per-feature numbers tell you why, and whether it was growth (fine) or regression (fix it).

Use the cheapest model that passes your evals

The most expensive habit in AI development is defaulting every call to the flagship model. Most production tasks — classification, extraction, reformatting, short summaries — work fine on small, fast models that cost a fraction as much. The price gap between a provider's smallest and largest models is typically an order of magnitude or more, so this single decision dwarfs every other optimization in this post. Reserve the big model for the tasks that measurably need it.

The disciplined version of this is routing: try the cheap model first, detect low-confidence or failed outputs, and escalate only those to the stronger model. But don't over-engineer it — even routing by task type alone (cheap model for autocomplete, strong model for long-form generation) captures most of the savings with none of the complexity.

The load-bearing phrase here is passes your evals. This only works if you have an evaluation set — real inputs with known-good outputs — that tells you whether the cheap model is actually good enough. Without evals, every model downgrade is a guess, and the first bad guess teaches teams to fearfully overpay forever. With evals, downgrading is a routine, reversible experiment. It's also how you capture savings over time: models get cheaper and better every year, and teams that re-run their evals against new releases keep finding they can drop a tier at the same quality.

Attack the input tokens

In most apps, input tokens dwarf output tokens: you send the system prompt, instructions, examples, history, and retrieved context on every call, and the model sends back a paragraph.

Trim the system prompt. Prompts accrete instructions over time like config files — every incident adds a rule, nobody removes one. Audit it quarterly; delete what no longer changes behavior (your evals will tell you).
Prune chat history. Don't send the entire conversation forever. Keep the last few turns verbatim plus a running summary of older ones. Users almost never notice; the difference on turn forty is enormous.
Retrieve less. If you're doing RAG, sending ten chunks "to be safe" costs real money on every request. Measure whether five do the job — usually they do.
Use prompt caching. Providers offer caching for the repeated prefix of your prompt (system prompt, examples, static context), charging a much lower rate for cached tokens. It usually requires structuring your prompt so the static parts come first and the variable parts last — a small refactor for a large recurring discount. If your prompt has a big static prefix, this is the closest thing to free money in this list.

Also cap max_tokens on every call. It's your hard ceiling against a model rambling at your expense, and against a prompt bug turning into an essay generator.

Cache whole responses where repetition exists

If many users trigger the same request — summarizing the same article, explaining the same error message, describing the same product — cache the response keyed on the normalized input and skip the model entirely on repeats. Exact-match caching is trivial to build, completely safe, and in content-heavy apps the hit rate is often shockingly high, because user behaviour follows the same power law everything else does.

Semantic caching — serving a cached answer when a new question is merely similar — saves more but introduces a real risk of serving subtly wrong answers to questions that looked alike and weren't. Use it only where approximate answers are acceptable, and log cache hits so you can audit what was served.

Design the product so waste never ships

The biggest savings are product decisions, not engineering ones, and they're nearly impossible to retrofit once users have expectations:

Debounce. Never call the model on every keystroke; wait for the user to pause or submit. This one line of code has saved more LLM budget than any clever routing system.
Make expensive actions explicit. A "Summarize" button costs money when pressed. Auto-summarizing everything on screen costs money always. Defaults are where budgets go to die.
Set per-user limits. A free tier with unmetered LLM access is an open tab, and the internet will find it. Rate-limit generously enough that honest users never hit the ceiling — the limit exists for the abusers and the bugs.
Don't regenerate silently. Retries and "improve this" loops multiply cost per user action; make them visible, deliberate choices, and cap automatic retries at one or two.

It's worth doing the unit economics explicitly: cost per active user per month at current usage, versus what that user pays you. If the feature is part of a free tier, that number is a marketing expense — fine, as long as someone decided it on purpose.

The takeaway

LLM cost control is four habits: measure per feature so you know where the money goes, route to the cheapest model your evals allow, shrink and cache the tokens you send, and shape the product so the expensive path is always the deliberate one. None of these is hard individually; the compounding effect of all four is routinely a large multiple off the naive bill. Do them early — retrofitting cost discipline after a viral month is a much worse week.