How to Add AI to a Flutter App with LLMs

Adding AI to a mobile app used to mean training a model, wrestling with conversion tools, and shipping weights inside the binary. Today, most "AI features" are a thin, well-designed layer between your app and a large language model (LLM). The hard part is no longer the model — it's the engineering around it: where the keys live, how the response reaches the screen, what happens when it fails, and who pays when a feature gets popular. This guide walks through how to add an LLM-powered feature to a Flutter app in a way that is fast, cheap, and safe to ship — and it applies almost unchanged to native iOS/Android or React Native, because the architecture is the point.

Never call the model directly from the app

The single most important rule, and the one that shapes everything else: your app should never talk to the LLM provider directly. If you embed an API key in a Flutter build, it will be extracted — anyone can unzip an APK or proxy the app's traffic, pull the key, and run up your bill until the provider suspends you. This isn't a theoretical risk; scraping published apps for embedded keys is a hobby industry. (I've written more in Securing API Keys in Mobile Apps.)

Instead, put a small backend between the app and the model:

Flutter app  →  your backend (holds the key)  →  LLM provider

Your backend can be genuinely tiny — a few dozen lines of Python (FastAPI is a natural fit) or Go. It does three jobs: hold the secret key, enforce per-user rate limits, and shape the request/response. It's also where everything you'll want later gets bolted on without touching the app: caching, logging, prompt changes, model swaps, A/B tests. The day you want to switch providers or tighten a prompt, you deploy a server change instead of shepherding an app release through store review — that alone pays for the backend many times over.

The request/response shape

Keep the contract between app and backend boring and explicit. The app sends the user's input plus any context; the backend returns structured data with named fields, not raw model text your UI has to guess at.

final res = await http.post(
  Uri.parse('https://api.yourbackend.com/assist'),
  headers: {'Authorization': 'Bearer $sessionToken'},
  body: jsonEncode({'prompt': userText, 'context': recentItems}),
);
final data = jsonDecode(res.body); // { "summary": "...", "actions": [...] }

Note what the app is not sending: the system prompt, the model name, the temperature. All of that belongs server-side, where you can iterate freely. The app's vocabulary is domain terms — "summarize this," "suggest a category" — and the backend translates those into whatever the model needs today. On the server, use the provider's structured output features and validate the model's response against a schema before returning it, so the app can trust every field it receives. A mobile app is the worst possible place to handle a malformed model response; make sure it never sees one.

Stream the response for perceived speed

LLMs are slow to produce a full answer but fast to produce the first token. A response that streams word-by-word feels dramatically faster than a spinner that sits for six seconds — total time identical, experience transformed. Expose a streaming endpoint (Server-Sent Events works well and is simpler than WebSockets) and render tokens as they arrive:

final request = http.Request('POST', uri)..body = payload;
final response = await request.send();
response.stream.transform(utf8.decoder).listen((chunk) {
  setState(() => _answer += chunk);
});

Two things to get right beyond the happy path: show a subtle "thinking" indicator before the first token lands (there's still a gap while the model reads the prompt), and wire a stop button that actually cancels the request end-to-end — it saves the user's time and your tokens. Streaming has its own production pitfalls (buffering proxies are the classic one); the streaming post covers the full checklist. For short structured outputs — a category, a title — skip streaming entirely; it only earns its keep on text a human reads as it appears.

Control cost before it controls you

LLM calls cost money per token, mobile users tap buttons a lot, and unlike server costs the bill scales one-to-one with engagement. Three cheap safeguards, all server-side:

Rate-limit per user (e.g. N requests per minute, M per day). This stops abuse, runaway retry loops, and the one enthusiastic user who would otherwise be 40% of your bill.
Cache identical requests. Many prompts repeat across users; a hash-keyed cache turns repeats into free, instant responses.
Pick the smallest model that works. Default to a fast, cheap model and escalate to a larger one only for requests that genuinely need it — most mobile AI tasks (categorize, summarize, rewrite) don't. A small evaluation set tells you what "works" means with numbers instead of vibes.

Also debounce anything wired to typing, cap max_tokens on every call, and set a billing alert with your provider before launch, not after the first surprising invoice. The cost-control post goes deeper on all of this.

Design for failure

The network will drop mid-response, the provider will occasionally return errors or have an outage, and the model will sometimes produce nonsense with total confidence. Treat AI features as fallible by default:

Always show a graceful "couldn't generate that — try again" state, and keep whatever partial output already arrived rather than blanking the screen.
Never let an AI response silently overwrite user data. Suggestions get applied when the user accepts them; destructive actions get confirmed.
Time out deliberately (LLM calls can hang), and make retries explicit user actions rather than silent loops that triple your costs.
Log failures on the backend so you know your real-world failure rate. "It works when I try it" and "it works for users on hotel Wi-Fi" are different claims.

Mobile adds one more: users background the app mid-request. Decide what they see when they return — a resumed stream, a completed result, or an honest "that didn't finish."

Keep the UX honest

Label AI output as AI-generated, give users an easy way to edit or dismiss it, and never present a model guess as a certain fact — especially for anything involving numbers, dates, or money, where a confident wrong answer erodes trust in the entire app, not just the feature. The pattern that consistently works: AI drafts, user confirms. An editable suggestion the user approves in one tap feels magical; an auto-committed guess the user discovers was wrong three weeks later feels like a betrayal.

Both app stores also have review expectations around AI-generated content and data disclosure, so say plainly in your privacy policy what leaves the device and where it goes. Honest UX and smooth store review turn out to be the same work.

Summary

A good AI feature in Flutter is 10% model and 90% plumbing: a backend that holds your keys and enforces limits, a boring JSON contract with validation on the server side, streaming for perceived speed, hard cost ceilings, and a UI that assumes the model can be wrong and lets the user stay in charge. Get that scaffolding right and you can swap models, tune prompts, and grow the feature for years without ever touching the app binary again — which, in a world of week-long store reviews, is exactly where you want to be.