What Is a Large Language Model?

PM: Read in full — 15 min

The Wrong Mental Models

When most people encounter an LLM for the first time, they reach for the nearest familiar analogy. Maybe it's a search engine — you ask a question, it finds an answer. Maybe it's a database — a giant store of facts that you query with natural language. Both analogies are wrong, and building products on either assumption leads to frustrating surprises.

A search engine retrieves documents. A database returns records. An LLM does neither. It generates text, one piece at a time, by predicting what should come next given everything that came before. That's a fundamentally different operation, and understanding it changes how you design, evaluate, and trust the output.

What an LLM Actually Does

At its core, a large language model is a next-token predictor. When you send a prompt, the model doesn't look anything up. It reads the full text of your prompt and then starts generating a response, one token at a time.

A token is the basic unit of text the model works with — roughly a word fragment, a word, or a short sequence of characters. The model maintains a vocabulary: a fixed list of all tokens it knows, typically 50,000–100,000 entries depending on the model. For each position in the output, the model assigns a score to every token in that vocabulary, producing a probability distribution — a ranked list of "how likely is each token to come next?"

One token is then sampled from that distribution, appended to the sequence, and the process repeats. This is called autoregressive generation: each generated token feeds back into the input context for the next prediction. The model never generates the entire response at once. It's always predicting one step ahead.

Temperature controls how sharp or flat the probability distribution is before sampling. At low temperature (close to 0), the model almost always picks the highest-scoring token — output becomes deterministic and conservative. At high temperature (1.5 or above), probability mass spreads across many candidates, and the model takes more surprising leaps. Most production deployments sit somewhere in between.

The Mechanism: The Transformer

The computation that converts your prompt into that probability distribution happens inside a Transformer — a neural network architecture introduced in the 2017 paper "Attention Is All You Need" (Vaswani et al.). Every major LLM today — GPT, Claude, Gemini, Llama, DeepSeek — is built on this architecture or a close variant.

A useful analogy: think of the Transformer as a codec. A video codec encodes a stream of pixels into a compact internal representation and decodes it back to output. The Transformer works similarly: it encodes your entire prompt into a set of rich vector representations — one per token — where each representation captures not just the token's standalone meaning but its meaning in context, shaped by every other token in the prompt. The output head then decodes those representations back into a probability distribution over the vocabulary.

The mechanism that makes context-sensitivity possible is attention: before predicting anything, the model computes relationships between every token and every other token, letting each one "look at" what's relevant around it. "Bank" in "river bank" attends strongly to "river" and gets a different internal representation than "bank" in "bank loan." The same word, different context, different representation — and therefore a different set of likely next tokens.

The full mechanics of how a transformer block is structured are in Transformer Architecture. The essential point for building products: the model doesn't look anything up — it encodes your input into a contextual representation, then decodes from that to the next token, one step at a time.

The Autocomplete Analogy

Your smartphone keyboard predicts the next word as you type. It's been doing this for years. What it learned came from your personal typing history — a few megabytes of data at most.

LLMs are the same idea, applied to an incomprehensibly larger dataset: essentially all publicly available text — books, articles, code repositories, forums, websites — processed over months of compute. Your phone's autocomplete learned that you often text "on my way." An LLM learned how academic papers argue, how code handles edge cases, how a doctor might phrase a diagnosis, and how a poet might end a stanza.

The difference is scale, not kind. That scale is what produces behavior that feels qualitatively different. It's also what makes failures look different — instead of a mildly awkward word suggestion, you get a confidently stated wrong fact.

Autoregressive Generation in Practice

Here's what actually happens when you submit a prompt:

Every arrow in that diagram represents a full pass through the model. Generating a 200-word response means running the model's forward pass roughly 250 times. That's why inference costs money and why latency scales with output length.

Common Misconceptions

LLMs do NOT search the web in real-time — unless the application has explicitly added a tool that does web retrieval and called it. The base model only has access to what was in its training data and what you put in the current prompt.
LLMs do NOT have memory between conversations by default — each API call starts fresh. Any appearance of memory is because the application is injecting prior conversation history into the prompt.
LLMs do NOT "know" facts — they predict likely text based on patterns in training data. If the training data contained a wrong fact stated confidently, the model learned to state that wrong fact confidently.

Why This Matters When Building Products

The autoregressive, probabilistic nature of LLMs explains several behaviors that confuse product teams.

Hallucination is not a bug. It's the model doing exactly what it was trained to do: predict the most plausible next token. If nothing in the context signals that a claim is uncertain, the model has no mechanism to hedge. It just picks the most likely continuation — and sometimes the most likely-sounding text happens to be wrong. Systems that rely on LLM output for factual accuracy need external validation: retrieval, citation, or a human review step.

Non-determinism is inherent. Running the same prompt twice at any temperature above zero will sometimes produce different results. This isn't a reliability issue — it's the sampling. Automated testing of LLM outputs needs to account for this; asserting exact string equality will fail.

Context dependency is total. The model doesn't maintain a separate "state" — the entire context window is the state. A prompt that buries the key instruction at the bottom will perform differently from one that puts it at the top. Prompt structure is a design decision with measurable impact on output quality.

PM Takeaway

Hallucinations are not bugs — they're the model doing exactly what it was trained to do: predict the most likely next token. Understanding this helps you design systems with appropriate validation layers rather than treating the model as an oracle.

The Wrong Mental Models​

What an LLM Actually Does​

The Mechanism: The Transformer​

The Autocomplete Analogy​

Autoregressive Generation in Practice​

Why This Matters When Building Products​

Further Reading​