Glossary

Plain-English definitions for terms used throughout this primer. Consult as needed — this is a reference page, not required reading front to back. Terms are alphabetical.

Agent — An LLM given the ability to take actions (tool calls) and receive observations in a loop, repeating until the goal is achieved. Agents can browse the web, run code, query databases, or call any API — anything expressible as a tool.

Attention mechanism — The mechanism by which each token in a sequence attends to (weighs the importance of) every other token, enabling the model to understand context and relationships. Attention is what lets "bank" mean a financial institution in one sentence and a riverbank in another.

Autoregressive — A generation style where each output token is conditioned on all previous tokens; the model generates one token at a time. All major chat models (GPT, Claude, Gemini, Llama) are autoregressive — they cannot go back and change earlier tokens once written.

Base model — A model trained only on next-token prediction, before any instruction tuning or alignment. Knows language but doesn't follow instructions well. A base model asked "What is the capital of France?" is as likely to continue with more questions as to answer the one asked.

BPE (Byte-Pair Encoding) — A tokenization algorithm that builds a vocabulary by repeatedly merging the most frequent adjacent byte pairs in training text. The result is a vocabulary where common words are single tokens and rare words are split into subword fragments.

Chain-of-thought — A prompting technique that asks the model to "think step by step," improving performance on multi-step reasoning tasks. The key insight is that reasoning written out in tokens is more reliable than reasoning that must happen implicitly inside a single forward pass.

Context window — The total amount of text (measured in tokens) a model can process in a single inference call. Includes system prompt, history, retrieved context, and user message. Everything outside the context window is invisible to the model.

Decoder-only — A transformer architecture that generates tokens left-to-right using causal attention. The architecture used by GPT, Claude, Llama, and most frontier chat models. "Decoder-only" means there is no separate encoder component — the same transformer stack handles both understanding and generation.

Embedding — A dense vector representation of text. Similar meanings result in vectors that are geometrically close. Enables semantic search and RAG retrieval: "automobile" and "car" end up near each other in embedding space even though they share no characters.

Encoder — A transformer component that processes all input tokens bidirectionally to produce rich contextual representations. Used in BERT-style models for classification and retrieval tasks. Unlike a decoder, an encoder can see the full input before producing any output.

Fine-tuning — Continued training of a pre-trained model on a specific dataset to adapt its behavior. Includes SFT (instruction tuning) and RLHF (preference alignment). Fine-tuning modifies model weights; prompting does not.

Function call — A structured request from an LLM to invoke an external capability, specifying the function name and arguments as a JSON object. The model doesn't execute the function itself — it outputs a structured description of what it wants called, and the calling application runs it.

Hallucination — When a model generates confident-sounding but factually incorrect output. A natural consequence of next-token prediction, not a bug. The model learned to produce fluent, plausible text — it has no built-in mechanism to verify whether a generated claim is true.

In-context learning — The ability of a model to learn from examples provided in the prompt (few-shot) without any weight updates. A model shown three examples of a task will perform better on a fourth example than with zero examples, purely from pattern-matching in the context window.

Instruction tuning — Fine-tuning a base model on (instruction, response) pairs to make it follow user directions. Also called SFT. The step that turns a capable-but-unruly base model into a useful assistant.

KV cache — A performance optimization that stores previously computed attention keys and values, avoiding recomputation across conversation turns. Without a KV cache, processing a 10-turn conversation would require reprocessing the entire history on every new turn.

LLM — Large Language Model. A transformer-based model trained on large text datasets to predict the next token; used for chat, code, analysis, and more. "Large" originally referred to parameter count in the billions; now it's a loose term for any frontier-class model.

Mixture of Experts (MoE) — An architecture where a router selects a small subset of specialized "expert" sub-networks for each token, reducing per-token compute. A 70B MoE model might activate only 14B parameters per token, giving large-model quality at smaller-model inference cost.

Multimodal — A model or system that processes multiple input types (text, images, audio, video) rather than text alone. Flagship models from OpenAI (GPT), Anthropic (Claude), and Google (Gemini) are multimodal — they can reason about images described or attached alongside text.

Parameter — A learned numerical weight in the model. Modern LLMs have billions to hundreds of billions of parameters. Parameters are what get updated during training and stored in the model file — they encode everything the model "knows."

Perplexity — A measure of how well a model predicts a text sample. Lower perplexity = better prediction. Used as a training-time metric; not a reliable benchmark for downstream task quality, but useful for comparing models on the same dataset.

RAG (Retrieval-Augmented Generation) — A pattern where relevant documents are retrieved at query time and injected into the prompt, grounding the model's answer in external knowledge. RAG is how you connect an LLM to a private knowledge base without retraining the model.

RLHF — Reinforcement Learning from Human Feedback. A training technique where a reward model trained on human preference rankings is used to fine-tune the LLM. RLHF is the primary reason modern chat models are more helpful and less toxic than their base model counterparts.

Scaling laws — Empirical findings showing that model loss decreases predictably as a power law of model size, data, and compute. Established by Kaplan et al. (2020) and refined by Hoffmann et al. (Chinchilla, 2022) — the Chinchilla result showed most models were undertrained relative to their parameter count.

Self-attention — The attention mechanism applied within a single sequence, allowing each token to attend to all other tokens in that same sequence. "Self" distinguishes it from cross-attention (attending to a separate sequence, as in encoder-decoder models).

SFT (Supervised Fine-Tuning) — Fine-tuning on labeled (prompt, response) pairs to teach instruction-following behavior. The first step after pre-training in most modern alignment pipelines; RLHF typically follows.

Temperature — A sampling parameter that controls the "peakedness" of the token probability distribution. Low temperature (near 0) = deterministic, picks the highest-probability token every time; high temperature (1+) = more random, samples more broadly from the distribution.

Token — The basic unit a model processes. Approximately 4 characters for English, roughly three-quarters of a word on average. All LLM pricing and context limits are measured in tokens. "The quick brown fox" is 4 words but 5 tokens in GPT tokenization.

Tokenizer — The component that converts raw text into token IDs before feeding to the model, and converts output token IDs back to text. Different models use different tokenizers — the same text produces different token counts across GPT, Claude, and Llama.

Transformer — The neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017) that underlies virtually every modern LLM. The key innovation was replacing recurrence with self-attention, enabling much more effective parallelism during training.

Zero-shot — Asking a model to perform a task with no examples provided in the prompt. Contrasts with few-shot (1–5 examples) and chain-of-thought (step-by-step reasoning examples). Modern frontier models handle many tasks zero-shot that earlier models needed examples for.