Tokens and Tokenization

PM: Read in full — 15 min

Your Bill Is in Tokens, Not Words

When you get an invoice from OpenAI or Anthropic, the line item isn't "words processed." It's tokens. Pricing pages quote rates like "$3.00 per million input tokens." Context window limits — the hard ceiling on how much text a model can process at once — are also measured in tokens. So is the rate limit you hit when you push too much text through in a short window.

Tokens are not words. They're not characters. They're something in between, and the gap between tokens and words produces some genuinely surprising model behavior. Understanding what a token is will save you money, prevent context overflows, and explain a few model failures that otherwise look inexplicable.

What a Token Actually Is

A token is the atomic unit of text that a language model processes. Before a model sees your prompt, a tokenizer converts your raw text into a sequence of token IDs — integers from a fixed vocabulary. The model operates on those integers. It never sees the characters directly.

Most modern LLMs use Byte-Pair Encoding (BPE) to build their vocabulary. The algorithm starts with individual characters, then repeatedly finds the most common adjacent pair and merges them into a single new token. Do this enough times and you end up with a vocabulary of 50,000–100,000 entries that contains single characters, common syllables, full words, and occasionally multi-word sequences. Common English words like "the" and "cat" typically become single tokens. Rare words get split: "tokenization" might tokenize as token + ization, or even token + iz + ation, depending on the vocabulary.

In practice, English prose averages roughly one token per 0.75 words — equivalently, about 4 characters per token. That ratio is not universal, and where it breaks down is where product teams get caught off guard.

Why Tokenization Produces Surprising Failures

Here's a well-known example. Ask an LLM to count the letter "r" in "strawberry." Many models will say two, when the correct answer is three. This isn't a reasoning failure — it's a tokenization artifact. The model likely sees "strawberry" as a single token or a small number of tokens, not as individual characters. It has no direct access to the letters inside a token; it just knows the token as a unit.

The same dynamic explains why simple multi-digit arithmetic is harder than it looks. Numbers tokenize unpredictably. The number 127 might be one token. 128 might be two tokens: 12 + 8. Or three: 1 + 28 + something else. Arithmetic that seems elementary requires the model to work against the grain of how it sees numbers — not as place-value digits, but as patterns of token IDs it happened to encounter during training.

Neither of these is a bug to be fixed in the next model version. They're structural consequences of how tokenization works. When a task requires character-level or digit-level precision, build a validation step outside the model rather than trusting the model to catch it.

Multilingual and Code Tokenization

BPE vocabularies are built from training data, and training data is dominated by English. That has a direct cost consequence: non-English languages typically consume far more tokens per unit of meaning.

Japanese, Korean, and Chinese use character-based writing systems with thousands of base characters. A vocabulary built primarily from English text won't have efficient multi-character merges for these scripts. A sentence that's 20 words in Japanese might consume 60–80 tokens versus the 20–25 you'd expect from an equivalent English sentence. You're paying two to three times as much to process the same information.

Code has a different profile. Syntax characters — parentheses, brackets, indentation, operators — often become individual tokens. Variable names with underscores split at underscores. The result is that code is denser in tokens than prose, but more predictably so. A Python file with 1000 characters will typically use more tokens than a 1000-character English paragraph, but the ratio is stable enough to plan around.

Content type	Approx. tokens per 1000 chars
English prose	~200
Python code	~250–300
Japanese/Korean	~400–500
Repeated tokens (e.g. "aaaa...")	Highly variable

Product Relevance

Every context window limit is a token budget. Some Gemini models support context windows of 1 million tokens or more — but that does not mean 1 million words. For English prose, a 1 million token window is roughly 750,000 words — still enormous, but not the same. For a Japanese-language application, it might be closer to 200,000–250,000 words of effective capacity.

Cost estimation follows the same logic. A 1000-word English document is roughly 1,333 tokens. Run that through a model at $3.00 per million input tokens and the cost is about $0.004 per document — negligible at small scale, real money at volume. If your application processes 10 million such documents a month, input costs alone are $40,000. That's before output tokens.

The practical implication for system design: don't think in words or characters when sizing prompts or estimating throughput. Think in tokens, apply the right conversion factor for your content type, and add a buffer. Context windows that seem large have a way of filling up faster than anticipated when you add a system prompt, a few examples, retrieved documents, and conversation history.

PM Takeaway

When estimating context usage or costs, budget roughly 1.3 tokens per English word. For documents in other languages or heavy code, multiply by 1.5–2×. Your context limit is a token budget, not a word budget.

Your Bill Is in Tokens, Not Words​

What a Token Actually Is​

Why Tokenization Produces Surprising Failures​

Multilingual and Code Tokenization​

Product Relevance​

Further Reading​

Your Bill Is in Tokens, Not Words

What a Token Actually Is

Why Tokenization Produces Surprising Failures

Multilingual and Code Tokenization

Product Relevance

Further Reading