How to Choose an LLM for Your App

There are more capable language models available today than ever, and picking one can feel paralysing: multiple frontier providers, a dozen sizes each, open-weight models you can host yourself, and a leaderboard that reshuffles every few months. The good news: you don't need the "best" model. You need the right model for a specific job, and the right model is usually not the biggest or most expensive one. Here's the framework I actually use.

Start from the task, not the model

Before comparing models, describe the job precisely:

Is it simple (classification, extraction, short rewrites) or hard (multi-step reasoning, coding, nuanced writing)?
How long is the input? A quick reply, or a whole document?
Does it run in the background, or is a user waiting for it in real time?
What does a failure cost? A mislabelled support ticket is annoying; a wrong answer shown confidently to a paying customer is a different class of problem.

Most app features are simple tasks that small, fast, cheap models handle perfectly. Summarising a review, extracting fields from an email, tagging content, rewriting a sentence in a friendlier tone — none of these needs a frontier model. People reach for the flagship out of habit and overpay for capability they never use, then wonder why the AI feature is the most expensive line in the budget.

Writing the task down also exposes a surprisingly common discovery: part of the job doesn't need an LLM at all. Validation, formatting, and lookups are still better done in ordinary code.

The five dimensions that matter

1. Quality. Can the model actually do the task reliably? Test it on your real inputs, not a generic benchmark. Public benchmarks measure performance on benchmark tasks — which correlates loosely, at best, with performance on your weird, domain-specific inputs. A model that tops a reasoning leaderboard can still be mediocre at extracting line items from your invoice format.

2. Latency. How fast does it respond, and how fast does it start responding? For anything a user waits on, a smaller model that answers in one second often beats a smarter one that takes eight — and if you're streaming the response, time-to-first-token matters more than total time. For background jobs, latency barely matters and you can trade it away for quality or cost.

3. Cost. Priced per token, and it adds up fast at scale. Estimate honestly: average input + output tokens per request × requests per user per day × your user count. Run that arithmetic before you fall in love with a model. A model that's "only" a few times more expensive per token can blow your budget once you have real traffic — and cost per token also differs between input and output, which matters a lot if your prompts are long but answers short (or vice versa).

4. Context length. How much text can the model consider at once? If you're feeding it long documents or lots of retrieved context, you need a window big enough for the job — but huge contexts cost real money on every call, and models don't attend to a 200-page dump as reliably as a focused excerpt. Don't pay for headroom you won't use; don't use headroom as a substitute for retrieval.

5. Privacy and deployment. Can you send this data to a third-party API at all? For sensitive data you may need a provider with strong data-handling guarantees (most major providers now offer no-training-on-API-data terms — read them, don't assume), a region-pinned deployment, or an open-weight model you host yourself. This constraint is binary and can override everything else, so check it first, not last.

Use a tiered strategy

The most cost-effective production systems rarely use one model for everything. They route:

A small, fast, cheap model handles the bulk of simple requests.
A larger model is called only for the requests that genuinely need more capability.
The cheapest possible model (or plain code) handles trivial cases — you don't need an LLM to check if a string is empty.

The routing rule doesn't need to be clever to be effective. It can be as dumb as "inputs over N words go to the big model" or "if the small model reports low confidence, escalate." This right-sizing often cuts costs dramatically with no visible drop in quality, because most requests were never hard to begin with. Every provider's lineup is structured around exactly this pattern — a flagship, a mid-size workhorse, and a small fast model — which tells you how the people who run these models at the largest scale expect them to be used.

Don't marry a single provider

Model quality, pricing, and availability change constantly — the best choice this quarter may be second-best by next quarter. Build a thin abstraction so swapping the underlying model is a config change, not a rewrite:

your code  →  llm(prompt, model="small")  →  provider adapter

Keep provider-specific details behind that one function: authentication, retry logic, response parsing, streaming. When a better or cheaper model appears — and it will — you switch in minutes instead of days. This also gives you a natural place for fallback logic when a provider has an outage, which every provider eventually does.

One caution: keep the abstraction thin. Heavyweight framework layers that promise provider independence often cost more in debugging opacity than they save in switching effort. A hundred lines of your own adapter code is usually the sweet spot.

Test before you commit

Assemble 20–50 real examples from your actual use case — real user inputs if you have them, realistic drafts if you don't — with the output you'd consider correct for each. Run your candidate models against them and compare quality, speed, and cost side by side in a spreadsheet. This half-day of work will teach you more than any leaderboard, and it becomes the seed of a proper evaluation set that pays dividends for the life of the feature: every future model release gets judged against the same bar, in an afternoon, with numbers instead of vibes.

While you're at it, test the failure modes, not just the happy path. Feed each candidate an empty input, an off-topic one, and something adversarial. Models differ noticeably in how gracefully they fail, and graceful failure is worth a lot in production.

Revisit the decision on a schedule

Whatever you choose will be the wrong choice eventually — prices drop, new models ship, your feature's requirements drift. Because you kept the model behind an abstraction and kept your evaluation set current, re-running the bake-off twice a year is cheap. Teams that do this routinely find they can move a tier down (or hold quality and cut cost) every year or so, just because the whole field keeps improving underneath them.

Summary

Choosing an LLM isn't about finding the smartest model — it's about matching capability to the task across quality, latency, cost, context, and privacy. Check the hard constraints (privacy, deployment) first, default to the smallest model that passes your real-world tests, escalate only for requests that need it, and keep the model behind a thin abstraction so you can always swap in something better. Then re-test on a schedule, because in this field the ground moves.