On-Device vs Cloud AI: Where Should Inference Run?

When you add AI to an app, one decision shapes everything else: does the model run on the user's device or in the cloud? It affects your privacy story, your costs, your offline behaviour, and how capable the feature can be. It also affects things people forget to budget for — app size, battery complaints, and how quickly you can fix a bad model behaviour once it's in the wild. Neither option is universally right, and the trade-offs are worth understanding properly before you commit, because switching later is expensive.

What "on-device" really means

On-device (or "edge") AI runs the model directly on the phone using its CPU, GPU, or dedicated neural hardware — via frameworks like Apple's Core ML or Google's ML Kit and LiteRT (the successor to TensorFlow Lite). The data never leaves the device. This is how features like on-device dictation, photo classification, live translation, and small local language models work. Both platforms have leaned hard into this in recent years: modern iPhones and flagship Androids ship neural processing units specifically so this class of feature is practical.

Strengths:

Privacy. Data stays on the device — a genuinely strong selling point, and sometimes a legal or policy requirement. "Your photos never leave your phone" is a sentence users understand instantly.
Offline. Works with no connection — on a plane, in a basement, in markets where data is expensive.
No per-request cost. Once shipped, running it is free to you. Ten users or ten million, your inference bill is zero.
Low latency for small models — no network round trip, which matters enormously for anything interactive like live camera features.

Limits:

Capability ceiling. Phones can't run the largest models. On-device models are smaller, and for open-ended language tasks the quality gap against a frontier cloud model is still very real.
App size and battery. Bundled models bloat the download — and a model that pins the GPU will drain battery and heat the device, which users notice and review-bomb.
Fragmentation. Performance varies wildly across the device fleet your users actually own. The feature that flies on this year's flagship may be unusable on a four-year-old budget phone, and you have to decide what happens on those devices.
Slow iteration. Fixing model behaviour means shipping an app update and waiting for users to install it. There's no server-side hotfix.

What cloud AI gives you

Cloud AI sends the request to a server (yours or a provider's) where a large model runs, then returns the result.

Strengths:

Full capability. Access to the largest, most capable models — for complex reasoning, coding, or nuanced writing there's currently no on-device substitute.
Consistency. Every user gets the same performance regardless of their phone.
Instant updates. Improve the model or prompt server-side with no app release. When something goes wrong, you can fix it for everyone in minutes.
Small app. No heavy model in the binary, no download-size hit.

Limits:

Requires connectivity. No network, no feature — and flaky networks give you slow, half-working behaviour that's arguably worse than a clean failure.
Per-request cost. You pay for every call, forever. This is manageable at launch and can become your biggest line item at scale — worth modelling honestly before you commit (I've written more about keeping LLM costs under control).
Privacy considerations. Data leaves the device, which you must disclose in your privacy policy and app-store data declarations, and handle responsibly. For some categories of data (health, messages, anything regulated) this is where the conversation ends.
Network latency. Every request pays a round-trip tax, and tail latency on mobile networks is brutal — your p95 user experience is set by the worst cell connection, not your data centre.

A framework for deciding

Ask these questions in order — the earlier ones can settle the decision on their own:

Is the data sensitive? If it's highly personal and users expect it to stay private, that pushes hard toward on-device. Regulation or platform policy can make this a requirement, not a preference.
Does it need to work offline? If yes, on-device is the only option, full stop.
How capable must the model be? Classification, extraction, OCR, and simple transformations fit comfortably on-device; open-ended generation and multi-step reasoning usually need the cloud.
What's the cost at scale? A free on-device model looks very different from a per-call cloud bill once you multiply by real usage. Do the arithmetic with actual numbers: requests per user per day × users × cost per request.
How often will the model change? Frequent iteration favours the cloud's instant updates. A stable, well-understood task (say, blurring faces in photos) can happily live on-device for years.

In my experience the first two questions decide it more often than people expect. Teams spend weeks comparing model quality when the real constraint was always "this data can't leave the phone" or "field workers use this with no signal."

The hybrid pattern

Increasingly, the best answer is both. A common architecture:

On-device handles the common, privacy-sensitive, latency-critical cases — and works offline.
The cloud handles the rare, heavy requests that need a bigger model.

The app tries the local model first and only escalates to the cloud when the task exceeds what it can do locally — with the user's knowledge, if data sensitivity demands it. Users get privacy and speed for everyday use, and power when they need it; you keep cloud costs down because only the hard requests ever leave the device. This is essentially how the platform vendors themselves architect their assistant features, with private cloud tiers backing on-device models, and it's a pattern that ages well: as on-device models improve, more traffic naturally stays local and your cloud bill shrinks without a rewrite.

The cost of hybrid is complexity. You maintain two inference paths, two sets of failure modes, and a routing rule that needs tuning. For a first release, it's often smarter to ship one path — whichever the framework above points to — and add the second once you have real usage data telling you where it hurts.

Common mistakes worth avoiding

Choosing cloud by default because it's easier to prototype. It is — but by the time cost or privacy forces a migration, the feature's behaviour is defined by a big model that no on-device model can match, and downgrading quality is painful.
Choosing on-device for a task that's still evolving. If you're iterating weekly on what the feature even does, shipping model updates through app releases will slow you to a crawl.
Ignoring the worst devices. Test on the oldest, cheapest hardware you claim to support, not just your own phone. Thermal throttling and memory pressure on low-end devices break demos that looked perfect in development.
Forgetting the disclosure work. Cloud inference means privacy-policy updates, App Store and Play data-safety declarations, and honest UI copy. It's not just an engineering decision.

Summary

There's no universal winner. On-device AI wins on privacy, offline support, and zero marginal cost but is limited in capability and slow to update; cloud AI wins on power, consistency, and iteration speed but costs money on every request and needs a network. Decide by working through data sensitivity, offline needs, required capability, cost at scale, and iteration speed — in that order — and don't overlook the hybrid approach, which often gives you the best of both once the feature has settled down.