Tokens and Context Windows Explained

If you build with language models, two concepts govern your costs, your latency, and what your app can even do: tokens and the context window. They sound like trivia — the sort of thing you skim past on the way to the interesting parts of the docs — but misunderstanding them leads directly to surprise bills, mysteriously truncated answers, and chatbots that get slower and pricier the longer someone talks to them. Here's the practical explanation I wish every LLM tutorial started with.

What is a token?

A token is a chunk of text — roughly a word or a piece of one. Models don't read characters or whole words; they read tokens, produced by a tokenizer that splits text into pieces from a fixed vocabulary. As a rough guide for English, one token is about four characters, and 100 tokens is about 75 words. Common words are often a single token; longer or rarer words split into several. You can paste text into a provider's tokenizer tool and watch the split happen — it's worth doing once, because a few things will surprise you:

Punctuation and whitespace count. So does your JSON formatting — all those quotes, braces, and repeated field names are tokens you pay for.
Code tokenizes differently from prose, and often less efficiently.
Non-English text frequently uses more tokens per word — sometimes several tokens per character in some scripts — which means the same feature costs more and fits less for users writing in some languages. If you have an international audience, this is worth checking rather than assuming.

Everything is counted in tokens: the user's input, your system instructions, retrieved context, and the model's output. This matters because:

You're billed per token — usually at different rates for input and output, with output typically costing several times more.
Models have token limits, not word limits.

So "make the prompt shorter" isn't a style preference — it's a direct lever on cost and speed.

What is the context window?

The context window is the maximum number of tokens the model can consider at once — input and output combined. Think of it as the model's short-term memory for a single request. If the window is 200,000 tokens, then your instructions + conversation history + retrieved documents + the answer must all fit within that budget. Exceed it and, depending on the API, you get a hard error or silent truncation of the oldest content — both of which manifest as bugs that only appear on long inputs, which makes them wonderfully annoying to reproduce.

Crucially, the model has no memory between requests. It doesn't "remember" your last message unless you send that message again as part of the new request. Every apparent "conversation" is really the whole history being re-sent each turn. Once this clicks, several confusing behaviours make sense at once: why long chats get more expensive over time (you're paying to resend a growing transcript), why the bot "forgets" things from early in a very long conversation (they were trimmed to fit the window), and why "continue" after a truncated answer works (the client resends everything plus a nudge).

There's usually a separate, smaller limit on the output — a max_tokens parameter — which is why a model can stop mid-sentence even when the window has plenty of room. If you see answers that end abruptly, check that limit before blaming the model.

Why this shapes your app

Cost scales with tokens. A chatbot that resends a long history every turn, or a RAG system that stuffs huge documents into every prompt, can cost far more than expected. The estimate is simple arithmetic: (input tokens + output tokens) × price per token × request volume. Run it with real numbers early — I've written more about keeping LLM costs under control, and every technique in that post is ultimately token management.

Latency scales with tokens too. More input takes longer to process; more output takes longer to generate — output is generated one token at a time, so a response twice as long takes roughly twice as long to finish. A leaner prompt is a faster prompt.

Bigger context is not always better. Modern models advertise windows in the hundreds of thousands of tokens, and it's tempting to treat that as "just paste everything in." But filling the window has three costs: money (you pay for every token, every call), speed (long inputs process slower), and attention — research has repeatedly shown models can pay less attention to information buried in the middle of a very long context. Relevant-and-short usually beats comprehensive-and-huge. Large windows are best treated as headroom for the cases that genuinely need them, not as the default operating mode.

Practical ways to manage tokens

Trim the prompt. Remove redundant instructions and boilerplate. Every token you cut saves money on every single call — a 500-token trim on a prompt that runs 100,000 times a month is real money.
Summarize long histories. Instead of resending an entire chat, periodically replace old turns with a short summary and keep only the recent turns verbatim. Users almost never notice; your bill does.
Retrieve, don't dump. In RAG, fetch only the most relevant chunks rather than whole documents. Measure whether five chunks do the job before sending ten "to be safe."
Cap output length. Set a sensible max_tokens on every call so the model can't ramble expensively — and so a misbehaving prompt has a hard ceiling on the damage.
Use prompt caching where offered. Providers can cache the repeated prefix of your prompt (system instructions, examples) and charge much less for it on subsequent calls. If your prompt has a big static prefix, this is nearly free savings.
Measure real usage. Log tokens per request per feature in production — providers return the counts in every response. You'll often find a few request types dominate your bill, and they're rarely the ones you'd have guessed.

A mental model

Picture each request as a fixed-size desk (the context window). Everything the model needs — your rules, the conversation so far, the reference material, and room to write the answer — has to fit on that desk at the same time, and the desk is swept completely clean between requests. Your job is to keep the desk tidy: put only what's relevant on it, and always leave enough space for the reply. Most token problems, viewed this way, are just clutter.

Summary

Tokens are how models measure text, and the context window is how much text they can hold at once — input and output together, with no memory carried between calls. Because you pay per token and latency scales with them too, managing tokens is one of the highest-leverage skills in LLM development. Keep prompts lean, summarize long histories, retrieve only what's relevant, cap outputs, and measure real usage. Master this and your AI features get cheaper and faster at the same time — it's one of the few optimizations with no trade-off attached.