Skip to main content

Glossary

Plain-English definitions for terms used throughout this primer. Consult as needed โ€” this is a reference page, not required reading front to back. Terms are alphabetical.


Agent โ€” An LLM given the ability to take actions (tool calls) and receive observations in a loop, repeating until the goal is achieved. Agents can browse the web, run code, query databases, or call any API โ€” anything expressible as a tool.

Attention mechanism โ€” The mechanism by which each token in a sequence attends to (weighs the importance of) every other token, enabling the model to understand context and relationships. Attention is what lets "bank" mean a financial institution in one sentence and a riverbank in another.

Autoregressive โ€” A generation style where each output token is conditioned on all previous tokens; the model generates one token at a time. All major chat models (GPT, Claude, Gemini, Llama) are autoregressive โ€” they cannot go back and change earlier tokens once written.

Base model โ€” A model trained only on next-token prediction, before any instruction tuning or alignment. Knows language but doesn't follow instructions well. A base model asked "What is the capital of France?" is as likely to continue with more questions as to answer the one asked.

BPE (Byte-Pair Encoding) โ€” A tokenization algorithm that builds a vocabulary by repeatedly merging the most frequent adjacent byte pairs in training text. The result is a vocabulary where common words are single tokens and rare words are split into subword fragments.

Chain-of-thought โ€” A prompting technique that asks the model to "think step by step," improving performance on multi-step reasoning tasks. The key insight is that reasoning written out in tokens is more reliable than reasoning that must happen implicitly inside a single forward pass.

Context window โ€” The total amount of text (measured in tokens) a model can process in a single inference call. Includes system prompt, history, retrieved context, and user message. Everything outside the context window is invisible to the model.

Decoder-only โ€” A transformer architecture that generates tokens left-to-right using causal attention. The architecture used by GPT, Claude, Llama, and most frontier chat models. "Decoder-only" means there is no separate encoder component โ€” the same transformer stack handles both understanding and generation.

Embedding โ€” A dense vector representation of text. Similar meanings result in vectors that are geometrically close. Enables semantic search and RAG retrieval: "automobile" and "car" end up near each other in embedding space even though they share no characters.

Encoder โ€” A transformer component that processes all input tokens bidirectionally to produce rich contextual representations. Used in BERT-style models for classification and retrieval tasks. Unlike a decoder, an encoder can see the full input before producing any output.

Fine-tuning โ€” Continued training of a pre-trained model on a specific dataset to adapt its behavior. Includes SFT (instruction tuning) and RLHF (preference alignment). Fine-tuning modifies model weights; prompting does not.

Function call โ€” A structured request from an LLM to invoke an external capability, specifying the function name and arguments as a JSON object. The model doesn't execute the function itself โ€” it outputs a structured description of what it wants called, and the calling application runs it.

Hallucination โ€” When a model generates confident-sounding but factually incorrect output. A natural consequence of next-token prediction, not a bug. The model learned to produce fluent, plausible text โ€” it has no built-in mechanism to verify whether a generated claim is true.

In-context learning โ€” The ability of a model to learn from examples provided in the prompt (few-shot) without any weight updates. A model shown three examples of a task will perform better on a fourth example than with zero examples, purely from pattern-matching in the context window.

Instruction tuning โ€” Fine-tuning a base model on (instruction, response) pairs to make it follow user directions. Also called SFT. The step that turns a capable-but-unruly base model into a useful assistant.

KV cache โ€” A performance optimization that stores previously computed attention keys and values, avoiding recomputation across conversation turns. Without a KV cache, processing a 10-turn conversation would require reprocessing the entire history on every new turn.

LLM โ€” Large Language Model. A transformer-based model trained on large text datasets to predict the next token; used for chat, code, analysis, and more. "Large" originally referred to parameter count in the billions; now it's a loose term for any frontier-class model.

Mixture of Experts (MoE) โ€” An architecture where a router selects a small subset of specialized "expert" sub-networks for each token, reducing per-token compute. A 70B MoE model might activate only 14B parameters per token, giving large-model quality at smaller-model inference cost.

Multimodal โ€” A model or system that processes multiple input types (text, images, audio, video) rather than text alone. GPT-4V, Claude 3, and Gemini Ultra are multimodal โ€” they can reason about images described or attached alongside text.

Parameter โ€” A learned numerical weight in the model. Modern LLMs have billions to hundreds of billions of parameters. Parameters are what get updated during training and stored in the model file โ€” they encode everything the model "knows."

Perplexity โ€” A measure of how well a model predicts a text sample. Lower perplexity = better prediction. Used as a training-time metric; not a reliable benchmark for downstream task quality, but useful for comparing models on the same dataset.

RAG (Retrieval-Augmented Generation) โ€” A pattern where relevant documents are retrieved at query time and injected into the prompt, grounding the model's answer in external knowledge. RAG is how you connect an LLM to a private knowledge base without retraining the model.

RLHF โ€” Reinforcement Learning from Human Feedback. A training technique where a reward model trained on human preference rankings is used to fine-tune the LLM. RLHF is the primary reason modern chat models are more helpful and less toxic than their base model counterparts.

Scaling laws โ€” Empirical findings showing that model loss decreases predictably as a power law of model size, data, and compute. Established by Kaplan et al. (2020) and refined by Hoffmann et al. (Chinchilla, 2022) โ€” the Chinchilla result showed most models were undertrained relative to their parameter count.

Self-attention โ€” The attention mechanism applied within a single sequence, allowing each token to attend to all other tokens in that same sequence. "Self" distinguishes it from cross-attention (attending to a separate sequence, as in encoder-decoder models).

SFT (Supervised Fine-Tuning) โ€” Fine-tuning on labeled (prompt, response) pairs to teach instruction-following behavior. The first step after pre-training in most modern alignment pipelines; RLHF typically follows.

Temperature โ€” A sampling parameter that controls the "peakedness" of the token probability distribution. Low temperature (near 0) = deterministic, picks the highest-probability token every time; high temperature (1+) = more random, samples more broadly from the distribution.

Token โ€” The basic unit a model processes. Approximately 4 characters for English, roughly three-quarters of a word on average. All LLM pricing and context limits are measured in tokens. "The quick brown fox" is 4 words but 5 tokens in GPT tokenization.

Tokenizer โ€” The component that converts raw text into token IDs before feeding to the model, and converts output token IDs back to text. Different models use different tokenizers โ€” the same text produces different token counts across GPT, Claude, and Llama.

Transformer โ€” The neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., NeurIPS 2017) that underlies virtually every modern LLM. The key innovation was replacing recurrence with self-attention, enabling much more effective parallelism during training.

Zero-shot โ€” Asking a model to perform a task with no examples provided in the prompt. Contrasts with few-shot (1โ€“5 examples) and chain-of-thought (step-by-step reasoning examples). Modern frontier models handle many tasks zero-shot that earlier models needed examples for.