Skip to main content

Context Windows โ€” Memory, Cost, and Limits

PM: Read in full โ€” 20 min

Every conversation with a language model starts from scratch. The model has no memory of what you discussed last Tuesday โ€” or even five minutes ago in a different browser tab. All it knows is what's in the current context window. Understanding what lives in that window, and what it costs, is one of the most practically useful things a product person can learn about LLMs.

What Is a Context Window?โ€‹

The context window is the model's working memory at inference time. It holds everything the model can "see" during a single call:

  • System prompt โ€” instructions, persona, and constraints you've set
  • Conversation history โ€” prior turns in the current session
  • Retrieved documents โ€” anything injected via retrieval-augmented generation (RAG)
  • The current user message โ€” what was just sent

All of these get concatenated into a single sequence of tokens and handed to the model at once. There's no background long-term storage being quietly consulted. The model sees exactly one flat list of tokens, and it generates the next one.

This has a direct implication: if your conversation history exceeds the context window, something gets dropped. Most implementations drop the oldest turns. The model then has no memory of that part of the conversation โ€” not because it "forgot," but because those tokens were never in the input.

The Cost of Being Longโ€‹

Attention โ€” the mechanism that gives transformers their power โ€” requires every token to attend to every other token in the window. Double the context length, and the computation doesn't double; it roughly quadruples. This is the "quadratic scaling" problem that comes up whenever someone asks why longer context costs more.

In practice, model vendors have developed approximations that make this scaling softer than strict O(nยฒ). But the fundamental dynamic holds: long contexts cost meaningfully more than short ones. A model with a 128k-token context window isn't four times as expensive as one with a 32k window โ€” it's considerably more.

The KV Cache: Why Multi-Turn Isn't As Slow As You'd Thinkโ€‹

Each token in the context requires computing two internal representations โ€” a key and a value โ€” used in the attention calculation. For a long conversation, recomputing these from scratch on every turn would be painful.

Instead, models maintain a KV cache: they store the keys and values for tokens they've already processed and reuse them on subsequent turns. When the user sends a new message, the model only needs to compute keys and values for the new tokens, then attend across the cached set plus the new ones.

This dramatically reduces latency for multi-turn conversations. The trade-off is GPU memory: every cached token takes up space. Providers with many simultaneous long-context conversations are managing a significant memory overhead just in KV cache storage.

Time to First Token (TTFT)โ€‹

Before you can show a user any output, the model has to process the entire input context. This phase โ€” called the prefill โ€” produces no visible output. Only after it's complete does the model start generating tokens.

The time between sending a request and seeing the first output token is called TTFT (time to first token). Longer context means longer prefill means higher TTFT. For a product where users expect responsive streaming โ€” a chat UI, a code autocomplete, a voice interface โ€” this is a critical metric, not just a backend concern.

A 100k-token context can add multiple seconds to TTFT even on well-optimized infrastructure. If you're building an interactive product, this is worth knowing before you design a feature that stuffs large documents into every request.

The "Lost in the Middle" Problemโ€‹

There's a practical quality issue that goes beyond cost: models often struggle to use information at the center of a very long context. Research has found that retrieval accuracy for facts buried in the middle of a 100k-token window drops noticeably compared to the same facts placed near the beginning or end.

This isn't a bug in the usual sense โ€” it's a characteristic of how attention works in practice. The model isn't equally attentive to all 100,000 tokens. If your architecture dumps a 50-page document into context and asks a specific question about page 22, you may get worse results than a RAG system that retrieves only the relevant two paragraphs.

Implications for Product Designโ€‹

Three decisions matter most:

  1. Don't stuff context; retrieve it. For large knowledge bases, RAG outperforms context stuffing on both cost and quality. Send only the relevant chunks, not the entire corpus.

  2. Model context window โ‰  effective context. A 128k-token window doesn't mean a 128k-token context reliably produces good results. Quality degrades at extreme lengths. Test at your actual usage lengths, not at the ceiling.

  3. Longer context windows cost more. Billing by token is standard across providers. A system that sends 50k tokens per request will have dramatically different economics than one that sends 2k. Budget for this early, and treat context length as an architecture decision.

PM Roadmap Tip

"Unlimited context" models still involve trade-offs: higher cost and higher TTFT. Design your product to use context efficiently โ€” don't pass everything; pass what's relevant. RAG is often the answer for large knowledge bases, and the quality argument for it is as strong as the cost argument.

Further Readingโ€‹