Tokens and Context Windows Explained
If you build with language models, two concepts govern your costs, your latency, and what your app can even do: tokens and the context window. They sound like trivia, but misunderstanding them leads directly to surprise bills and mysterious failures. Here's a practical explanation.
What is a token?
A token is a chunk of text — roughly a word or a piece of one. Models don't read characters or whole words; they read tokens. As a rough guide for English, one token is about four characters, and 100 tokens is about 75 words. Common words are often a single token; longer or rarer words split into several.
Everything is counted in tokens: the user's input, your system instructions, retrieved context, and the model's output. This matters because:
- You're billed per token — usually at different rates for input and output.
- Models have token limits, not word limits.
So "make the prompt shorter" isn't a style preference — it's a direct lever on cost and speed.
What is the context window?
The context window is the maximum number of tokens the model can consider at once — input and output combined. Think of it as the model's short-term memory for a single request. If the window is 128,000 tokens, then your instructions + conversation history + retrieved documents + the answer must all fit within that budget.
Crucially, the model has no memory between requests. It doesn't "remember" your last message unless you send that message again as part of the new request. Every apparent "conversation" is really the whole history being re-sent each turn. That's why long chats get more expensive over time — you're paying to resend a growing transcript.
Why this shapes your app
Cost scales with tokens. A chatbot that resends a long history every turn, or a RAG system that stuffs huge documents into every prompt, can cost far more than expected. Estimate: (input tokens + output tokens) × price, times your request volume.
Latency scales with tokens too. More input takes longer to process; more output takes longer to generate. A leaner prompt is a faster prompt.
Bigger context is not always better. Even when a model can accept 100k tokens, filling it has downsides: it costs more, it's slower, and models can pay less attention to information buried in the middle of a very long context. Relevant-and-short usually beats comprehensive-and-huge.
Practical ways to manage tokens
- Trim the prompt. Remove redundant instructions and boilerplate. Every token you cut saves money on every call.
- Summarize long histories. Instead of resending an entire chat, periodically replace old turns with a short summary.
- Retrieve, don't dump. In RAG, fetch only the most relevant chunks rather than whole documents.
- Cap output length. Set a sensible maximum on the response so the model can't ramble expensively.
- Measure real usage. Log tokens per request in production. You'll often find a few request types dominate your bill.
A mental model
Picture each request as a fixed-size desk (the context window). Everything the model needs — your rules, the history, the reference material, and room to write the answer — has to fit on that desk at the same time. Your job is to keep the desk tidy: put only what's relevant on it, and leave enough space for the reply.
Summary
Tokens are how models measure text, and the context window is how much text they can hold at once — input and output together. Because you pay per token and models have no memory between calls, managing tokens is one of the highest-leverage skills in LLM development. Keep prompts lean, summarize long histories, retrieve only what's relevant, and measure real usage. Master this and you'll ship AI features that are both cheaper and faster.