Glossary
Plain-English definitions for terms used throughout this primer. Consult as needed โ this is a reference page, not required reading front to back. Terms are alphabetical.
Agent โ An LLM given the ability to take actions (tool calls) and receive observations in a loop, repeating until the goal is achieved. Agents can browse the web, run code, query databases, or call any API โ anything expressible as a tool.
Attention mechanism โ The mechanism by which each token in a sequence attends to (weighs the importance of) every other token, enabling the model to understand context and relationships. Attention is what lets "bank" mean a financial institution in one sentence and a riverbank in another.
Autoregressive โ A generation style where each output token is conditioned on all previous tokens; the model generates one token at a time. All major chat models (GPT, Claude, Gemini, Llama) are autoregressive โ they cannot go back and change earlier tokens once written.
Base model โ A model trained only on next-token prediction, before any instruction tuning or alignment. Knows language but doesn't follow instructions well. A base model asked "What is the capital of France?" is as likely to continue with more questions as to answer the one asked.
BPE (Byte-Pair Encoding) โ A tokenization algorithm that builds a vocabulary by repeatedly merging the most frequent adjacent byte pairs in training text. The result is a vocabulary where common words are single tokens and rare words are split into subword fragments.
Chain-of-thought โ A prompting technique that asks the model to "think step by step," improving performance on multi-step reasoning tasks. The key insight is that reasoning written out in tokens is more reliable than reasoning that must happen implicitly inside a single forward pass.
Context window โ The total amount of text (measured in tokens) a model can process in a single inference call. Includes system prompt, history, retrieved context, and user message. Everything outside the context window is invisible to the model.
Decoder-only โ A transformer architecture that generates tokens left-to-right using causal attention. The architecture used by GPT, Claude, Llama, and most frontier chat models. "Decoder-only" means there is no separate encoder component โ the same transformer stack handles both understanding and generation.
Embedding โ A dense vector representation of text. Similar meanings result in vectors that are geometrically close. Enables semantic search and RAG retrieval: "automobile" and "car" end up near each other in embedding space even though they share no characters.
Encoder โ A transformer component that processes all input tokens bidirectionally to produce rich contextual representations. Used in BERT-style models for classification and retrieval tasks. Unlike a decoder, an encoder can see the full input before producing any output.
Fine-tuning โ Continued training of a pre-trained model on a specific dataset to adapt its behavior. Includes SFT (instruction tuning) and RLHF (preference alignment). Fine-tuning modifies model weights; prompting does not.
Function call โ A structured request from an LLM to invoke an external capability, specifying the function name and arguments as a JSON object. The model doesn't execute the function itself โ it outputs a structured description of what it wants called, and the calling application runs it.
Hallucination โ When a model generates confident-sounding but factually incorrect output. A natural consequence of next-token prediction, not a bug. The model learned to produce fluent, plausible text โ it has no built-in mechanism to verify whether a generated claim is true.
In-context learning โ The ability of a model to learn from examples provided in the prompt (few-shot) without any weight updates. A model shown three examples of a task will perform better on a fourth example than with zero examples, purely from pattern-matching in the context window.
Instruction tuning โ Fine-tuning a base model on (instruction, response) pairs to make it follow user directions. Also called SFT. The step that turns a capable-but-unruly base model into a useful assistant.
KV cache โ A performance optimization that stores previously computed attention keys and values, avoiding recomputation across conversation turns. Without a KV cache, processing a 10-turn conversation would require reprocessing the entire history on every new turn.
LLM โ Large Language Model. A transformer-based model trained on large text datasets to predict the next token; used for chat, code, analysis, and more. "Large" originally referred to parameter count in the billions; now it's a loose term for any frontier-class model.
Mixture of Experts (MoE) โ An architecture where a router selects a small subset of specialized "expert" sub-networks for each token, reducing per-token compute. A 70B MoE model might activate only 14B parameters per token, giving large-model quality at smaller-model inference cost.
Multimodal โ A model or system that processes multiple input types (text, images, audio, video) rather than text alone. GPT-4V, Claude 3, and Gemini Ultra are multimodal โ they can reason about images described or attached alongside text.
Parameter โ A learned numerical weight in the model. Modern LLMs have billions to hundreds of billions of parameters. Parameters are what get updated during training and stored in the model file โ they encode everything the model "knows."
Perplexity โ A measure of how well a model predicts a text sample. Lower perplexity = better prediction. Used as a training-time metric; not a reliable benchmark for downstream task quality, but useful for comparing models on the same dataset.
RAG (Retrieval-Augmented Generation) โ A pattern where relevant documents are retrieved at query time and injected into the prompt, grounding the model's answer in external knowledge. RAG is how you connect an LLM to a private knowledge base without retraining the model.
RLHF โ Reinforcement Learning from Human Feedback. A training technique where a reward model trained on human preference rankings is used to fine-tune the LLM. RLHF is the primary reason modern chat models are more helpful and less toxic than their base model counterparts.
Scaling laws โ Empirical findings showing that model loss decreases predictably as a power law of model size, data, and compute. Established by Kaplan et al. (2020) and refined by Hoffmann et al. (Chinchilla, 2022) โ the Chinchilla result showed most models were undertrained relative to their parameter count.
Self-attention โ The attention mechanism applied within a single sequence, allowing each token to attend to all other tokens in that same sequence. "Self" distinguishes it from cross-attention (attending to a separate sequence, as in encoder-decoder models).
SFT (Supervised Fine-Tuning) โ Fine-tuning on labeled (prompt, response) pairs to teach instruction-following behavior. The first step after pre-training in most modern alignment pipelines; RLHF typically follows.
Temperature โ A sampling parameter that controls the "peakedness" of the token probability distribution. Low temperature (near 0) = deterministic, picks the highest-probability token every time; high temperature (1+) = more random, samples more broadly from the distribution.
Token โ The basic unit a model processes. Approximately 4 characters for English, roughly three-quarters of a word on average. All LLM pricing and context limits are measured in tokens. "The quick brown fox" is 4 words but 5 tokens in GPT tokenization.
Tokenizer โ The component that converts raw text into token IDs before feeding to the model, and converts output token IDs back to text. Different models use different tokenizers โ the same text produces different token counts across GPT, Claude, and Llama.
Transformer โ The neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., NeurIPS 2017) that underlies virtually every modern LLM. The key innovation was replacing recurrence with self-attention, enabling much more effective parallelism during training.
Zero-shot โ Asking a model to perform a task with no examples provided in the prompt. Contrasts with few-shot (1โ5 examples) and chain-of-thought (step-by-step reasoning examples). Modern frontier models handle many tasks zero-shot that earlier models needed examples for.