Skip to main content

The Training Pipeline โ€” From Raw Text to Assistant

PM: Read in full โ€” 20 min

The Gap Between "Predicts Text" and "Helpful Assistant"โ€‹

A raw language model trained only to predict the next token is not a product. It completes text patterns in ways that are eerie but often useless. Turning a next-token predictor into a coherent assistant that follows instructions, refuses harmful requests, and produces helpful responses requires a multi-phase training pipeline.

Understanding this pipeline explains why models behave the way they do โ€” why they're sometimes overconfident, sometimes overly hedged, and why the same base model produces vastly different products depending on how it's fine-tuned.

Phase 1: Pretrainingโ€‹

Goal: Teach the model language, world knowledge, and reasoning from massive unlabeled text.

The model initializes with random weights and trains to predict the next token across hundreds of billions to trillions of tokens from the internet, books, code repositories, and academic papers.

No labels, no instructions, no human feedback. The supervision signal is implicit: the ground truth for "what comes next" already exists in the training data. This makes LLM pretraining tractable at scale โ€” no annotation workforce required, only compute and data.

What the model learns:

  • Syntax, grammar, and fluency
  • Factual knowledge from the training corpus (with its biases and gaps)
  • Code patterns, math procedures, domain vocabularies
  • Reasoning patterns observed in high-quality text

Pretraining is the most expensive phase. GPT-4 class models require thousands of GPUs for months. This is why there are only a handful of frontier pretraining labs.

Phase 2: Supervised Fine-Tuning (SFT)โ€‹

Goal: Teach the model to follow the format of instructions and produce assistant-style responses.

After pretraining, the model knows language but doesn't know how to behave as an assistant. In SFT, human annotators create (instruction, ideal-response) pairs. The model trains on these pairs using standard next-token prediction, but now the text to predict is the desired assistant response.

SFT transforms a "continue this text" model into a "follow this instruction" model. It teaches the format and style of helpfulness โ€” not the value judgment of what's harmful, just the surface pattern of a good response.

Phase 3: Alignment (RLHF / Constitutional AI)โ€‹

Goal: Teach the model to be helpful, harmless, and honest in situations not fully covered by SFT.

RLHF (Reinforcement Learning from Human Feedback):

  1. The model generates several candidate responses to the same prompt.
  2. Human annotators rank the responses.
  3. A reward model trains to predict those rankings.
  4. The original model fine-tunes via reinforcement learning to maximize the reward model's scores.

Constitutional AI (Anthropic's approach): instead of human rankers for every comparison, the model critiques its own outputs against a written "constitution" of principles. Scales better than pure RLHF on harmlessness while preserving helpfulness.

Why This Pipeline Shapes Model Behaviorโ€‹

Calibration issues come from pretraining. If the training data confidently states wrong facts, the model learned to state those facts confidently. Alignment can teach the model to hedge on uncertainty but doesn't rewrite its factual knowledge.

Over-refusal comes from misapplied RLHF. If the reward model penalizes harmful content and fine-tuning overshoots, the model refuses things that aren't actually harmful. This is a known failure mode of heavy alignment.

System prompt effectiveness comes from SFT. The model learned to follow system prompt format because SFT data included many system-prompt-plus-instruction pairs. A well-written system prompt works because the model was trained on well-structured examples.

Knowledge cutoff is a pretraining artifact. The model knows what was in the training corpus; nothing after the cutoff date. RAG, tool use, and prompting with current context are the only ways to provide post-cutoff information.

PM Takeaway

When a model fails, ask which phase produced the failure. Factual errors trace to pretraining. Instruction-following problems trace to SFT. Value misalignment โ€” over-refusing reasonable requests, generating harmful content โ€” traces to the alignment phase. Different failure modes have different fixes.

Further Readingโ€‹

Ouyang, L. et al. โ€” NeurIPS 2022 (2022)
Read:Abstract, Section 1, Figure 2 (the three-phase diagram).Skip:Sections 2โ€“4 (model details, evaluation results tables).
InstructGPT showed that RLHF โ€” having humans rank model outputs and training toward those preferences โ€” dramatically improves helpfulness without sacrificing capability.