Paper Reference Guide

All papers cited across this primer, grouped by topic. Each entry includes a PM reading guide — what to read and what to skip — so you get the key insight without wading through methodology sections.

Papers are free on arXiv or linked directly. Reading guides are calibrated for a PM/product audience; engineers and researchers should read further.

Foundations

Attention Is All You Need

Authors: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
Venue: NeurIPS 2017
URL: https://arxiv.org/abs/1706.03762

PM reading guide: Read the abstract, Section 1 (Introduction), and Figure 1 only. Skip Sections 3–5 (the mathematical formulation of attention and training details).

Takeaway: Introduced the Transformer — the architecture underlying virtually every modern LLM. The core idea is replacing recurrent connections with self-attention, which parallelizes better and handles long-range dependencies more effectively.

Efficient Estimation of Word Representations in Vector Space (Word2Vec)

Authors: Mikolov, Chen, Corrado, Dean
Venue: arXiv 2013
URL: https://arxiv.org/abs/1301.3781

PM reading guide: Read the abstract only. Skip everything else — the technical contribution is the training method, which you don't need.

Takeaway: Introduced word embeddings — words as vectors that capture semantic meaning. The famous result: king - man + woman ≈ queen. This was the conceptual ancestor of all modern embedding models.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Authors: Devlin, Chang, Lee, Toutanova
Venue: NAACL 2019
URL: https://arxiv.org/abs/1810.04805

PM reading guide: Read the abstract, Section 1, and Figure 1. Skip Sections 3–5.

Takeaway: The canonical encoder-only model. Bidirectional pre-training — seeing the whole sentence before predicting any part — dramatically improves NLP task performance. BERT-style models still power most production retrieval and classification pipelines.

Language Models are Unsupervised Multitask Learners (GPT-2)

Authors: Radford, Wu, Child, Luan, Amodei, Sutskever (OpenAI)
Venue: OpenAI Technical Report 2019
URL: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

PM reading guide: Read the abstract and Section 1. Skip Sections 2–4.

Takeaway: GPT-2 showed a large decoder-only model trained on raw text could perform surprisingly well on diverse tasks with no task-specific training. Scale alone was generating emergent capabilities — a signal that pointed toward GPT-3 and everything after.

Training & Alignment

Training Language Models to Follow Instructions with Human Feedback (InstructGPT)

Authors: Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray, Schulman, Hilton, Kelton, Miller, Simens, Askell, Welinder, Christiano, Leike, Lowe (OpenAI)
Venue: NeurIPS 2022
URL: https://arxiv.org/abs/2203.02155

PM reading guide: Read the abstract, Section 1, and Figure 2 (the three-phase training diagram). Skip Sections 2–4.

Takeaway: RLHF — having humans rank model outputs and training toward those preferences — dramatically improves helpfulness. This paper is the direct ancestor of ChatGPT and every instruction-following model since.

Constitutional AI: Harmlessness from AI Feedback

Authors: Bai, Jones, Ndousse, Askell, Chen, DasSarma, Drain, Fort, Ganguli, Henighan, Johnston, Kravec, Lovitt, Mazeika, Tamkin, Tran-Johnson, Wang, Kaplan, Clark, Brown, McCandlish, Amodei, Mana (Anthropic)
Venue: Anthropic Technical Report 2022
URL: https://arxiv.org/abs/2212.08073

PM reading guide: Read the abstract and Section 1. Skip Sections 2–5.

Takeaway: Anthropic's alignment approach — the model critiques its own outputs against a written "constitution" of principles — scales better than pure RLHF by replacing the most expensive part (human feedback on every output) with AI feedback guided by explicit principles.

Training a Helpful and Harmless Assistant with RLHF

Authors: Bai, Jones, Ndousse, Askell, Chen, DasSarma, Drain, Fort, Ganguli, Henighan, Johnston, Kravec, Mina, Olsson, Lovitt, Tamkin, Tran-Johnson, Yang, Zhang, Clark, Mishkin, McCandlish, Radford, Amodei, Brown (Anthropic)
Venue: arXiv 2022
URL: https://arxiv.org/abs/2204.05862

PM reading guide: Read the abstract and Figure 1. Skip the technical sections.

Takeaway: Empirical study of RLHF showing the tradeoff between helpfulness and harmlessness — making a model safer often makes it less helpful, and vice versa. This tension is still unresolved and relevant to every AI product decision.

Architecture & Scaling

Scaling Laws for Neural Language Models

Authors: Kaplan, McCandlish, Henighan, Brown, Chess, Child, Gray, Radford, Wu, Amodei (OpenAI)
Venue: arXiv 2020
URL: https://arxiv.org/abs/2001.08361

PM reading guide: Read the abstract, Section 1, and Figure 1. Skip Sections 2–5.

Takeaway: Model performance improves predictably with scale (parameters, data, compute) as a power law. This finding justified the massive compute investments behind GPT-3 and the models that followed.

Training Compute-Optimal Large Language Models (Chinchilla)

Authors: Hoffmann, Borgeaud, Mensch, Buchatskaya, Cai, Rutherford, de las Casas, Hendricks, Welbl, Clark, Hennigan, Noland, Millican, van den Driessche, Damoc, Guy, Osindero, Simonyan, Elsen, Rae, Vinyals, Sifre (DeepMind)
Venue: NeurIPS 2022
URL: https://arxiv.org/abs/2203.15556

PM reading guide: Read the abstract only.

Takeaway: Optimal training uses roughly 20 tokens per parameter. Many large models (including GPT-3) were undertrained — they had too many parameters relative to the data they were trained on. Chinchilla, trained on more data with fewer parameters, matched or beat much larger models.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Authors: Shazeer, Mirhoseini, Maziarz, Davis, Le, Hinton, Dean (Google)
Venue: ICLR 2017
URL: https://arxiv.org/abs/1701.06538

PM reading guide: Read the abstract, Section 1, and Figure 1. Skip Sections 2–5.

Takeaway: A learned router selects which expert sub-networks to activate per token, enabling very large models at reduced per-token compute cost. The foundation for Mixtral, DeepSeek, Gemini, and numerous other MoE models — both open-weight and closed.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Authors: Fedus, Zoph, Shazeer (Google)
Venue: JMLR 2022
URL: https://arxiv.org/abs/2101.03961

PM reading guide: Read the abstract and Figure 1. Skip the technical sections.

Takeaway: MoE scales to trillion-parameter models with surprisingly simple top-1 routing (each token goes to exactly one expert). Showed that earlier MoE complexity was unnecessary and that sparse models could match dense quality at a fraction of the compute.

An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale (ViT)

Authors: Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit, Houlsby (Google)
Venue: ICLR 2021
URL: https://arxiv.org/abs/2010.11929

PM reading guide: Read the abstract and Figure 1. Skip the technical sections.

Takeaway: Images divided into patches treated as tokens — the approach underlying multimodal models from OpenAI, Anthropic, Google, and others. Once you can tokenize images the same way as text, a single transformer can reason over both.

Prompting & Reasoning

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Authors: Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou (Google)
Venue: NeurIPS 2022
URL: https://arxiv.org/abs/2201.11903

PM reading guide: Read the abstract and Figures 1–2. Skip Sections 3–5.

Takeaway: Prompting the model to "think step by step" dramatically improves multi-step reasoning. The mechanism: reasoning written out as tokens gives the model intermediate computation space that implicit (non-token) reasoning doesn't have.

Large Language Models are Zero-Shot Reasoners (Zero-shot CoT)

Authors: Kojima, Gu, Reid, Matsuo, Iwasawa
Venue: NeurIPS 2022
URL: https://arxiv.org/abs/2205.11916

PM reading guide: Read the abstract.

Takeaway: "Let's think step by step" — a single phrase — unlocks multi-step reasoning without any examples. The fact that this works is philosophically interesting: it suggests the model knows how to reason, but defaults to not doing so unless prompted.

ReAct: Synergizing Reasoning and Acting in Language Models

Authors: Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao
Venue: ICLR 2023
URL: https://arxiv.org/abs/2210.03629

PM reading guide: Read the abstract and Figure 1. Skip Sections 3–5.

Takeaway: Interleaving reasoning ("I should look up...") with acting (tool call) is the foundation of most modern agent architectures. ReAct is why LLM agents "think before they act" — the reasoning trace isn't just logging, it improves action quality.

Vector Search & Retrieval Infrastructure

Matryoshka Representation Learning (MRL)

Authors: Kusupati, Bhatt, Rege, Wallingford, Sinha, Ramanujan, et al.
Venue: NeurIPS 2022
URL: https://arxiv.org/abs/2205.09787

PM reading guide: Read the abstract, Section 1, and Figure 1. Skip Sections 3–6 (loss formulation, classification ablations, experimental tables).

Takeaway: Training embeddings so that any prefix of dimensions is itself a high-quality embedding. A 1536-dimensional MRL model can be truncated to 256 or 512 dimensions at query time with only modest quality loss — you trade retrieval precision for storage and compute cost without retraining. OpenAI's text-embedding-3 models use MRL; it's the mechanism behind the dimensions API parameter.

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs (HNSW)

Authors: Malkov, Y. A.; Yashunin, D. A.
Venue: IEEE TPAMI 2020
URL: https://arxiv.org/abs/1603.09320

PM reading guide: Read the abstract, Section 1, and Figure 1. Skip Sections 2–5 (graph construction proofs, complexity analysis, benchmark tables).

Takeaway: The foundational paper for HNSW — the graph-based ANN index used by Qdrant, Weaviate, pgvector, and most production vector databases. The key idea: a multi-layer navigable graph lets you jump to the right region of the embedding space quickly, then refine at the dense bottom layer, achieving approximate nearest-neighbor search orders of magnitude faster than exhaustive scan.

Retrieval & Grounding

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (RAG)

Authors: Lewis, Perez, Piktus, Petroni, Karpukhin, Goyal, Küttler, Lewis, Yih, Rocktäschel, Riedel, Kiela (Facebook AI)
Venue: NeurIPS 2020
URL: https://arxiv.org/abs/2005.11401

PM reading guide: Read the abstract and Figure 1. Skip Sections 2–5.

Takeaway: Retrieve relevant documents at query time and inject them into the prompt — reduces hallucination on domain-specific questions by grounding generation in real sources. The paper that named and formalized the RAG pattern now used everywhere.

Evaluation

Measuring Massive Multitask Language Understanding (MMLU)

Authors: Hendrycks, Burns, Basart, Zou, Mazeika, Song, Steinhardt
Venue: ICLR 2021
URL: https://arxiv.org/abs/2009.03300

PM reading guide: Read the abstract.

Takeaway: 57-subject multiple-choice benchmark for general LLM knowledge; now largely saturated by frontier models (most score above 85%). MMLU scores still appear in model release announcements but should be interpreted cautiously — the benchmark was not designed for models this capable.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Authors: Zheng, Chiang, Sheng, Zhuang, Wu, Zhuang, Lin, Li, Li, Xing, Zhang, Gonzalez, Stoica (LMSYS)
Venue: NeurIPS 2023
URL: https://arxiv.org/abs/2306.05685

PM reading guide: Read the abstract and Section 1.

Takeaway: Chatbot Arena uses human blind A/B comparisons to rank models by ELO — currently the most trusted real-world quality signal. The paper also validates using a capable LLM as an automated judge, enabling cheaper evaluation at scale.

Agentic Tooling & Ecosystems

Model Context Protocol (MCP)

Authors: Anthropic
Venue: Anthropic Engineering Blog 2024
URL: https://www.anthropic.com/news/model-context-protocol

PM reading guide: Read the blog post in full.

Takeaway: Anthropic's open standard for connecting LLMs to external tools — described as "USB-C for AI integrations." MCP defines a standard protocol so tool integrations built for one model work with others. Now supported by major IDEs and AI tooling vendors.

Toolformer: Language Models Can Teach Themselves to Use Tools

Authors: Schick, Dwivedi-Yu, Dessi, Raileanu, Lomeli, Zettlemoyer, Cancedda, Scialom (Meta)
Venue: NeurIPS 2023
URL: https://arxiv.org/abs/2302.04761

PM reading guide: Read the abstract and Figure 1.

Takeaway: LLMs can learn to call APIs mid-generation — the conceptual foundation for modern function calling. The self-supervised training approach lets the model learn when tool use helps without requiring explicit supervision for every case.

GPT-4 Technical Report

Authors: OpenAI
Venue: arXiv 2023
URL: https://arxiv.org/abs/2303.08774

PM reading guide: Read the abstract, Section 1, and the Evals overview section.

Takeaway: Documents GPT-4's capabilities and safety evaluations; notable for what it deliberately omits about architecture (parameter count, training data, and compute are all withheld). Sets the template for frontier model releases as capability-and-safety reports rather than technical disclosures.

Claude's Model Card (Claude 3 family)

Authors: Anthropic
Venue: Anthropic 2024
URL: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

PM reading guide: Read all of it.

Takeaway: Transparent capabilities and safety evaluations for the Claude 3 family (Haiku, Sonnet, Opus). A benchmark for how frontier labs should communicate model properties — includes benchmark scores, safety red-teaming results, and known limitations without excessive hedging. Anthropic has continued publishing model cards for subsequent releases; this one established the template.

Foundations​

Attention Is All You Need​

Efficient Estimation of Word Representations in Vector Space (Word2Vec)​

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding​

Language Models are Unsupervised Multitask Learners (GPT-2)​

Training & Alignment​

Training Language Models to Follow Instructions with Human Feedback (InstructGPT)​

Constitutional AI: Harmlessness from AI Feedback​

Training a Helpful and Harmless Assistant with RLHF​

Architecture & Scaling​

Scaling Laws for Neural Language Models​

Training Compute-Optimal Large Language Models (Chinchilla)​

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer​

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity​

An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale (ViT)​

Prompting & Reasoning​

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models​

Large Language Models are Zero-Shot Reasoners (Zero-shot CoT)​

ReAct: Synergizing Reasoning and Acting in Language Models​

Vector Search & Retrieval Infrastructure​

Matryoshka Representation Learning (MRL)​

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs (HNSW)​

Retrieval & Grounding​

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (RAG)​

Evaluation​

Measuring Massive Multitask Language Understanding (MMLU)​

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena​

Agentic Tooling & Ecosystems​

Model Context Protocol (MCP)​

Toolformer: Language Models Can Teach Themselves to Use Tools​

GPT-4 Technical Report​

Claude's Model Card (Claude 3 family)​

Foundations

Attention Is All You Need

Efficient Estimation of Word Representations in Vector Space (Word2Vec)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Language Models are Unsupervised Multitask Learners (GPT-2)

Training & Alignment

Training Language Models to Follow Instructions with Human Feedback (InstructGPT)

Constitutional AI: Harmlessness from AI Feedback

Training a Helpful and Harmless Assistant with RLHF

Architecture & Scaling

Scaling Laws for Neural Language Models

Training Compute-Optimal Large Language Models (Chinchilla)

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale (ViT)

Prompting & Reasoning

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Large Language Models are Zero-Shot Reasoners (Zero-shot CoT)

ReAct: Synergizing Reasoning and Acting in Language Models

Vector Search & Retrieval Infrastructure

Matryoshka Representation Learning (MRL)

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs (HNSW)

Retrieval & Grounding

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (RAG)

Evaluation

Measuring Massive Multitask Language Understanding (MMLU)

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Agentic Tooling & Ecosystems

Model Context Protocol (MCP)

Toolformer: Language Models Can Teach Themselves to Use Tools

GPT-4 Technical Report

Claude's Model Card (Claude 3 family)