What Is a Large Language Model?
The Wrong Mental Modelsโ
When most people encounter an LLM for the first time, they reach for the nearest familiar analogy. Maybe it's a search engine โ you ask a question, it finds an answer. Maybe it's a database โ a giant store of facts that you query with natural language. Both analogies are wrong, and building products on either assumption leads to frustrating surprises.
A search engine retrieves documents. A database returns records. An LLM does neither. It generates text, one piece at a time, by predicting what should come next given everything that came before. That's a fundamentally different operation, and understanding it changes how you design, evaluate, and trust the output.
What an LLM Actually Doesโ
At its core, a large language model is a next-token predictor. When you send a prompt, the model doesn't look anything up. It reads the full text of your prompt and then starts generating a response, one token at a time.
A token is the basic unit of text the model works with โ roughly a word fragment, a word, or a short sequence of characters. The model maintains a vocabulary: a fixed list of all tokens it knows, typically 50,000โ100,000 entries depending on the model. For each position in the output, the model assigns a score to every token in that vocabulary, producing a probability distribution โ a ranked list of "how likely is each token to come next?"
One token is then sampled from that distribution, appended to the sequence, and the process repeats. This is called autoregressive generation: each generated token feeds back into the input context for the next prediction. The model never generates the entire response at once. It's always predicting one step ahead.
Temperature controls how sharp or flat the probability distribution is before sampling. At low temperature (close to 0), the model almost always picks the highest-scoring token โ output becomes deterministic and conservative. At high temperature (1.5 or above), probability mass spreads across many candidates, and the model takes more surprising leaps. Most production deployments sit somewhere in between.
The Autocomplete Analogyโ
Your smartphone keyboard predicts the next word as you type. It's been doing this for years. What it learned came from your personal typing history โ a few megabytes of data at most.
LLMs are the same idea, applied to an incomprehensibly larger dataset: essentially all publicly available text โ books, articles, code repositories, forums, websites โ processed over months of compute. Your phone's autocomplete learned that you often text "on my way." An LLM learned how academic papers argue, how code handles edge cases, how a doctor might phrase a diagnosis, and how a poet might end a stanza.
The difference is scale, not kind. That scale is what produces behavior that feels qualitatively different. It's also what makes failures look different โ instead of a mildly awkward word suggestion, you get a confidently stated wrong fact.
Autoregressive Generation in Practiceโ
Here's what actually happens when you submit a prompt:
Every arrow in that diagram represents a full pass through the model. Generating a 200-word response means running the model's forward pass roughly 250 times. That's why inference costs money and why latency scales with output length.
- LLMs do NOT search the web in real-time โ unless the application has explicitly added a tool that does web retrieval and called it. The base model only has access to what was in its training data and what you put in the current prompt.
- LLMs do NOT have memory between conversations by default โ each API call starts fresh. Any appearance of memory is because the application is injecting prior conversation history into the prompt.
- LLMs do NOT "know" facts โ they predict likely text based on patterns in training data. If the training data contained a wrong fact stated confidently, the model learned to state that wrong fact confidently.
Why This Matters When Building Productsโ
The autoregressive, probabilistic nature of LLMs explains several behaviors that confuse product teams.
Hallucination is not a bug. It's the model doing exactly what it was trained to do: predict the most plausible next token. If nothing in the context signals that a claim is uncertain, the model has no mechanism to hedge. It just picks the most likely continuation โ and sometimes the most likely-sounding text happens to be wrong. Systems that rely on LLM output for factual accuracy need external validation: retrieval, citation, or a human review step.
Non-determinism is inherent. Running the same prompt twice at any temperature above zero will sometimes produce different results. This isn't a reliability issue โ it's the sampling. Automated testing of LLM outputs needs to account for this; asserting exact string equality will fail.
Context dependency is total. The model doesn't maintain a separate "state" โ the entire context window is the state. A prompt that buries the key instruction at the bottom will perform differently from one that puts it at the top. Prompt structure is a design decision with measurable impact on output quality.
Hallucinations are not bugs โ they're the model doing exactly what it was trained to do: predict the most likely next token. Understanding this helps you design systems with appropriate validation layers rather than treating the model as an oracle.
Further Readingโ
- How the Transformer Architecture Works โ the architectural mechanism that makes LLMs possible, covering attention, layers, and how context gets processed
- The Training Pipeline โ why the model "knows" things: what data it saw, how it was trained, and what that means for capabilities and gaps