Skip to main content

Paper Reference Guide

All papers cited across this primer, grouped by topic. Each entry includes a PM reading guide β€” what to read and what to skip β€” so you get the key insight without wading through methodology sections.

Papers are free on arXiv or linked directly. Reading guides are calibrated for a PM/product audience; engineers and researchers should read further.


Foundations​

Attention Is All You Need​

Authors: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
Venue: NeurIPS 2017
URL: https://arxiv.org/abs/1706.03762

PM reading guide: Read the abstract, Section 1 (Introduction), and Figure 1 only. Skip Sections 3–5 (the mathematical formulation of attention and training details).

Takeaway: Introduced the Transformer β€” the architecture underlying virtually every modern LLM. The core idea is replacing recurrent connections with self-attention, which parallelizes better and handles long-range dependencies more effectively.


Efficient Estimation of Word Representations in Vector Space (Word2Vec)​

Authors: Mikolov, Chen, Corrado, Dean
Venue: arXiv 2013
URL: https://arxiv.org/abs/1301.3781

PM reading guide: Read the abstract only. Skip everything else β€” the technical contribution is the training method, which you don't need.

Takeaway: Introduced word embeddings β€” words as vectors that capture semantic meaning. The famous result: king - man + woman β‰ˆ queen. This was the conceptual ancestor of all modern embedding models.


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding​

Authors: Devlin, Chang, Lee, Toutanova
Venue: NAACL 2019
URL: https://arxiv.org/abs/1810.04805

PM reading guide: Read the abstract, Section 1, and Figure 1. Skip Sections 3–5.

Takeaway: The canonical encoder-only model. Bidirectional pre-training β€” seeing the whole sentence before predicting any part β€” dramatically improves NLP task performance. BERT-style models still power most production retrieval and classification pipelines.


Language Models are Unsupervised Multitask Learners (GPT-2)​

Authors: Radford, Wu, Child, Luan, Amodei, Sutskever (OpenAI)
Venue: OpenAI Technical Report 2019
URL: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

PM reading guide: Read the abstract and Section 1. Skip Sections 2–4.

Takeaway: GPT-2 showed a large decoder-only model trained on raw text could perform surprisingly well on diverse tasks with no task-specific training. Scale alone was generating emergent capabilities β€” a signal that pointed toward GPT-3 and everything after.


Training & Alignment​

Training Language Models to Follow Instructions with Human Feedback (InstructGPT)​

Authors: Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray, Schulman, Hilton, Kelton, Miller, Simens, Askell, Welinder, Christiano, Leike, Lowe (OpenAI)
Venue: NeurIPS 2022
URL: https://arxiv.org/abs/2203.02155

PM reading guide: Read the abstract, Section 1, and Figure 2 (the three-phase training diagram). Skip Sections 2–4.

Takeaway: RLHF β€” having humans rank model outputs and training toward those preferences β€” dramatically improves helpfulness. This paper is the direct ancestor of ChatGPT and every instruction-following model since.


Constitutional AI: Harmlessness from AI Feedback​

Authors: Bai, Jones, Ndousse, Askell, Chen, DasSarma, Drain, Fort, Ganguli, Henighan, Johnston, Kravec, Lovitt, Mazeika, Tamkin, Tran-Johnson, Wang, Kaplan, Clark, Brown, McCandlish, Amodei, Mana (Anthropic)
Venue: Anthropic Technical Report 2022
URL: https://arxiv.org/abs/2212.08073

PM reading guide: Read the abstract and Section 1. Skip Sections 2–5.

Takeaway: Anthropic's alignment approach β€” the model critiques its own outputs against a written "constitution" of principles β€” scales better than pure RLHF by replacing the most expensive part (human feedback on every output) with AI feedback guided by explicit principles.


Training a Helpful and Harmless Assistant with RLHF​

Authors: Bai, Jones, Ndousse, Askell, Chen, DasSarma, Drain, Fort, Ganguli, Henighan, Johnston, Kravec, Mina, Olsson, Lovitt, Tamkin, Tran-Johnson, Yang, Zhang, Clark, Mishkin, McCandlish, Radford, Amodei, Brown (Anthropic)
Venue: arXiv 2022
URL: https://arxiv.org/abs/2204.05862

PM reading guide: Read the abstract and Figure 1. Skip the technical sections.

Takeaway: Empirical study of RLHF showing the tradeoff between helpfulness and harmlessness β€” making a model safer often makes it less helpful, and vice versa. This tension is still unresolved and relevant to every AI product decision.


Architecture & Scaling​

Scaling Laws for Neural Language Models​

Authors: Kaplan, McCandlish, Henighan, Brown, Chess, Child, Gray, Radford, Wu, Amodei (OpenAI)
Venue: arXiv 2020
URL: https://arxiv.org/abs/2001.08361

PM reading guide: Read the abstract, Section 1, and Figure 1. Skip Sections 2–5.

Takeaway: Model performance improves predictably with scale (parameters, data, compute) as a power law. This finding justified the massive compute investments behind GPT-3 and the models that followed.


Training Compute-Optimal Large Language Models (Chinchilla)​

Authors: Hoffmann, Borgeaud, Mensch, Buchatskaya, Cai, Rutherford, de las Casas, Hendricks, Welbl, Clark, Hennigan, Noland, Millican, van den Driessche, Damoc, Guy, Osindero, Simonyan, Elsen, Rae, Vinyals, Sifre (DeepMind)
Venue: NeurIPS 2022
URL: https://arxiv.org/abs/2203.15556

PM reading guide: Read the abstract only.

Takeaway: Optimal training uses roughly 20 tokens per parameter. Many large models (including GPT-3) were undertrained β€” they had too many parameters relative to the data they were trained on. Chinchilla, trained on more data with fewer parameters, matched or beat much larger models.


Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer​

Authors: Shazeer, Mirhoseini, Maziarz, Davis, Le, Hinton, Dean (Google)
Venue: ICLR 2017
URL: https://arxiv.org/abs/1701.06538

PM reading guide: Read the abstract, Section 1, and Figure 1. Skip Sections 2–5.

Takeaway: A learned router selects which expert sub-networks to activate per token, enabling very large models at reduced per-token compute cost. The foundation for Mixtral, GPT-4 (reportedly), and Gemini's MoE variants.


Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity​

Authors: Fedus, Zoph, Shazeer (Google)
Venue: JMLR 2022
URL: https://arxiv.org/abs/2101.03961

PM reading guide: Read the abstract and Figure 1. Skip the technical sections.

Takeaway: MoE scales to trillion-parameter models with surprisingly simple top-1 routing (each token goes to exactly one expert). Showed that earlier MoE complexity was unnecessary and that sparse models could match dense quality at a fraction of the compute.


An Image is Worth 16Γ—16 Words: Transformers for Image Recognition at Scale (ViT)​

Authors: Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit, Houlsby (Google)
Venue: ICLR 2021
URL: https://arxiv.org/abs/2010.11929

PM reading guide: Read the abstract and Figure 1. Skip the technical sections.

Takeaway: Images divided into patches treated as tokens β€” the approach underlying multimodal models like GPT-4V and Claude 3. Once you can tokenize images the same way as text, a single transformer can reason over both.


Prompting & Reasoning​

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models​

Authors: Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou (Google)
Venue: NeurIPS 2022
URL: https://arxiv.org/abs/2201.11903

PM reading guide: Read the abstract and Figures 1–2. Skip Sections 3–5.

Takeaway: Prompting the model to "think step by step" dramatically improves multi-step reasoning. The mechanism: reasoning written out as tokens gives the model intermediate computation space that implicit (non-token) reasoning doesn't have.


Large Language Models are Zero-Shot Reasoners (Zero-shot CoT)​

Authors: Kojima, Gu, Reid, Matsuo, Iwasawa
Venue: NeurIPS 2022
URL: https://arxiv.org/abs/2205.11916

PM reading guide: Read the abstract.

Takeaway: "Let's think step by step" β€” a single phrase β€” unlocks multi-step reasoning without any examples. The fact that this works is philosophically interesting: it suggests the model knows how to reason, but defaults to not doing so unless prompted.


ReAct: Synergizing Reasoning and Acting in Language Models​

Authors: Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao
Venue: ICLR 2023
URL: https://arxiv.org/abs/2210.03629

PM reading guide: Read the abstract and Figure 1. Skip Sections 3–5.

Takeaway: Interleaving reasoning ("I should look up...") with acting (tool call) is the foundation of most modern agent architectures. ReAct is why LLM agents "think before they act" β€” the reasoning trace isn't just logging, it improves action quality.


Retrieval & Grounding​

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (RAG)​

Authors: Lewis, Perez, Piktus, Petroni, Karpukhin, Goyal, KΓΌttler, Lewis, Yih, RocktΓ€schel, Riedel, Kiela (Facebook AI)
Venue: NeurIPS 2020
URL: https://arxiv.org/abs/2005.11401

PM reading guide: Read the abstract and Figure 1. Skip Sections 2–5.

Takeaway: Retrieve relevant documents at query time and inject them into the prompt β€” reduces hallucination on domain-specific questions by grounding generation in real sources. The paper that named and formalized the RAG pattern now used everywhere.


Evaluation​

Measuring Massive Multitask Language Understanding (MMLU)​

Authors: Hendrycks, Burns, Basart, Zou, Mazeika, Song, Steinhardt
Venue: ICLR 2021
URL: https://arxiv.org/abs/2009.03300

PM reading guide: Read the abstract.

Takeaway: 57-subject multiple-choice benchmark for general LLM knowledge; now largely saturated by frontier models (most score above 85%). MMLU scores still appear in model release announcements but should be interpreted cautiously β€” the benchmark was not designed for models this capable.


Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena​

Authors: Zheng, Chiang, Sheng, Zhuang, Wu, Zhuang, Lin, Li, Li, Xing, Zhang, Gonzalez, Stoica (LMSYS)
Venue: NeurIPS 2023
URL: https://arxiv.org/abs/2306.05685

PM reading guide: Read the abstract and Section 1.

Takeaway: Chatbot Arena uses human blind A/B comparisons to rank models by ELO β€” currently the most trusted real-world quality signal. The paper also validates using a capable LLM as an automated judge, enabling cheaper evaluation at scale.


Agentic Tooling & Ecosystems​

Model Context Protocol (MCP)​

Authors: Anthropic
Venue: Anthropic Engineering Blog 2024
URL: https://www.anthropic.com/news/model-context-protocol

PM reading guide: Read the blog post in full.

Takeaway: Anthropic's open standard for connecting LLMs to external tools β€” described as "USB-C for AI integrations." MCP defines a standard protocol so tool integrations built for one model work with others. Now supported by major IDEs and AI tooling vendors.


Toolformer: Language Models Can Teach Themselves to Use Tools​

Authors: Schick, Dwivedi-Yu, Dessi, Raileanu, Lomeli, Zettlemoyer, Cancedda, Scialom (Meta)
Venue: NeurIPS 2023
URL: https://arxiv.org/abs/2302.04761

PM reading guide: Read the abstract and Figure 1.

Takeaway: LLMs can learn to call APIs mid-generation β€” the conceptual foundation for modern function calling. The self-supervised training approach lets the model learn when tool use helps without requiring explicit supervision for every case.


GPT-4 Technical Report​

Authors: OpenAI
Venue: arXiv 2023
URL: https://arxiv.org/abs/2303.08774

PM reading guide: Read the abstract, Section 1, and the Evals overview section.

Takeaway: Documents GPT-4's capabilities and safety evaluations; notable for what it deliberately omits about architecture (parameter count, training data, and compute are all withheld). Sets the template for frontier model releases as capability-and-safety reports rather than technical disclosures.


Claude's Model Card (Claude 3 family)​

Authors: Anthropic
Venue: Anthropic 2024
URL: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

PM reading guide: Read all of it.

Takeaway: Transparent capabilities and safety evaluations for Claude 3 Haiku, Sonnet, and Opus. A model for how frontier labs should communicate model properties β€” includes benchmark scores, safety red-teaming results, and known limitations without excessive hedging.