Fine-Tuning vs RAG vs Prompting: How to Add Knowledge to an LLM

When a language model doesn't behave the way your app needs, there are three levers you can pull: prompting, retrieval (RAG), and fine-tuning. They're often discussed as competitors — usually in threads insisting one of them is dead — but they solve different problems, and the right answer is frequently a combination. The expensive mistakes happen when teams grab the heavyweight lever first: I've seen "we need to fine-tune" turn into weeks of dataset wrangling for a problem a better system prompt fixed in an afternoon. This post lays out what each technique actually does so you can choose deliberately.

First, diagnose the problem

Before picking a technique, figure out what kind of gap you're facing:

Knowledge gap — the model doesn't know something (your docs, your data, events after its training cutoff).
Behaviour gap — the model knows enough but doesn't act the way you want (wrong tone, format, or style).
Skill gap — the model can't reliably perform a specialized task no matter how you ask.

The technique you choose should match the gap. Using fine-tuning to fix a knowledge gap, or RAG to fix a tone problem, is a common and expensive category error — like buying a bigger engine because your tires are flat.

A quick self-test that clarifies most cases: paste the missing information directly into a prompt and try the task. If the model does it well with the information in front of it, you have a knowledge gap (prompting or RAG will fix it). If it still does it badly, you have a behaviour or skill gap (better prompting first, fine-tuning if that fails).

Prompting: cheapest, start here

Prompting means shaping behaviour purely through instructions and examples in the prompt itself. No infrastructure, no training — just words, versioned in your repo like any other production prompt.

Fixes: behaviour gaps (tone, format, structure) and small knowledge gaps you can simply paste in.
Strengths: instant, free to change, no data pipeline. You can iterate in minutes, and roll back by reverting a file.
Limits: everything must fit in the context window, and you pay for those tokens on every call. It can't teach genuinely new skills — a model that can't do the task with a perfect prompt won't learn it from a longer one.

Worth noting: modern context windows are large enough that "small knowledge gap" covers more than people think. A product FAQ, a style guide, a schema — these can often just live in the prompt permanently, especially with prompt caching making the repeated tokens cheap. That's not an inelegant hack; it's the simplest architecture that works.

Rule of thumb: always try prompting first, seriously — with a well-structured prompt and a couple of good examples, not a one-liner. A surprising share of "we need to fine-tune" situations dissolve at this step.

RAG: for knowledge that changes

Retrieval-Augmented Generation fetches relevant information at question time and puts it in the prompt, so the model answers grounded in your data. (I've written a full explainer on the pipeline.)

Fixes: knowledge gaps — especially large or frequently-changing knowledge that can't ride along in every prompt.
Strengths: always up to date (change the data, not the model), can cite sources so users can verify answers, and keeps facts out of the model's weights where they'd go stale. Updating knowledge is a database write, not a training run.
Limits: adds a retrieval pipeline (chunking, embeddings, a vector store) whose quality now bounds your feature's quality — most "bad answer" bugs in RAG systems are retrieval bugs. And it doesn't change the model's behaviour or skills at all; it just changes what the model has read before answering.

Rule of thumb: if the problem is "the model doesn't know about our stuff," RAG is almost always the answer — not fine-tuning. This is the single most common confusion in the field, so it bears repeating plainly: fine-tuning is not a good way to teach a model facts.

Fine-tuning: for behaviour and skill at scale

Fine-tuning continues training the model on your examples, adjusting its weights so the new behaviour becomes built in rather than instructed.

Fixes: behaviour and skill gaps — a consistent voice, a specialized output format, or a narrow task the base model handles poorly (think: classifying domain-specific documents, writing in your app's exact microcopy style, handling a niche language pair).
Strengths: bakes the behaviour in, so you don't spend prompt tokens re-explaining it on every call; can make a small cheap model perform like a bigger one on that narrow task, which at high volume is a real cost win; and can achieve consistency that prompting alone struggles to match.
Limits: needs a quality dataset (typically hundreds of good examples — and "good" is the hard part), costs time and money, must be redone when you want changes or when you move to a newer base model, and — crucially — it is poor at adding knowledge. Facts trained in are hard to update, blend unpredictably with what the model already believed, and can still be recalled wrongly. You also take on ML-ops overhead: dataset versioning, training runs, and evaluation before every swap.

Rule of thumb: reach for fine-tuning only after prompting and RAG fall short, and only for stable behaviours or skills — never as a way to store changing facts.

They combine

These aren't mutually exclusive; strong systems layer them:

RAG + prompting: retrieve the facts, and use a well-crafted prompt to control how they're presented. This combination covers the vast majority of "chat with our data" products, and for most teams it's where the story ends — happily.
Fine-tuning + RAG: fine-tune the model for your task's style and format, then use RAG to feed it current facts. Behaviour from the weights, knowledge from retrieval — each mechanism doing the one thing it's actually good at.
Fine-tuning + prompting: even a fine-tuned model still gets a system prompt; tuning just means the prompt can be much shorter.

A quick decision guide

Need a different tone or output format? → Prompting (then fine-tuning if it must be perfect, constant, and high-volume).
Need the model to know your documents/data? → RAG.
Need a consistent specialized skill the base model can't do, and you have data to teach it? → Fine-tuning.
Need to cut costs on a narrow high-volume task? → Fine-tune a small model, judged against your evals.
Not sure? → Start with prompting, add RAG for knowledge, and only fine-tune if a measured gap remains.

One discipline makes the whole decision honest: an evaluation set. Without one, you can't actually tell whether fine-tuning beat the better prompt, or whether RAG closed the gap. With one, the comparison is an afternoon of running numbers instead of a month of opinions.

Summary

Prompting, RAG, and fine-tuning aren't ranked from worst to best — they address different gaps. Prompting shapes behaviour instantly and cheaply, and it's where every project should start; RAG supplies changing knowledge with grounding and citations; fine-tuning bakes in stable behaviours and skills but is the wrong tool for facts. Diagnose the gap first, start with the cheapest lever, measure with real evals, and combine techniques so each one does the job it's actually good at.